Derivation of the Correlation Coefficient
Data Science and A.I. Lecture Series
Problem Statement
Objective: Derive the formula for the correlation coefficient \( r(X, Y) \):
\[
r(X, Y) = \frac{\sigma_X^2 + \sigma_Y^2 – \sigma_{X-Y}^2}{2 \sigma_X \sigma_Y}.
\]
Definitions:
- \( \sigma_X^2 \): Variance of \( X \).
- \( \sigma_Y^2 \): Variance of \( Y \).
- \( \sigma_{X-Y}^2 \): Variance of \( Z = X – Y \).
- Covariance between \( X \) and \( Y \): \( \text{Cov}(X, Y) \).
Step 1: Variance of \( Z = X – Y \)
Define \( Z = X – Y \). The variance of \( Z \) is:
\[
\sigma_{X-Y}^2 = \frac{1}{n} \sum_{i=1}^n \left( z_i – \overline{Z} \right)^2.
\]
Where:
- \( z_i = x_i – y_i \): Difference between corresponding values of \( X \) and \( Y \).
- \( \overline{Z} = \overline{X} – \overline{Y} \): Mean of \( Z \), obtained as the difference of the means of \( X \) and \( Y \).
Substitute \( z_i = x_i – y_i \):
\[
\sigma_{X-Y}^2 = \frac{1}{n} \sum_{i=1}^n \left\{ (x_i – \overline{X}) – (y_i – \overline{Y}) \right\}^2.
\]
Step 2: Expanding the Variance
Expand the squared term inside the summation:
\[
\sigma_{X-Y}^2 = \frac{1}{n} \sum_{i=1}^n \left[ (x_i – \overline{X})^2 + (y_i – \overline{Y})^2 – 2 (x_i – \overline{X})(y_i – \overline{Y}) \right].
\]
This gives three components:
- \( \frac{1}{n} \sum_{i=1}^n (x_i – \overline{X})^2 = \sigma_X^2 \), the variance of \( X \).
- \( \frac{1}{n} \sum_{i=1}^n (y_i – \overline{Y})^2 = \sigma_Y^2 \), the variance of \( Y \).
- \( \frac{1}{n} \sum_{i=1}^n (x_i – \overline{X})(y_i – \overline{Y}) = \text{Cov}(X, Y) \), the covariance between \( X \) and \( Y \).
Substitute these into \( \sigma_{X-Y}^2 \):
\[
\sigma_{X-Y}^2 = \sigma_X^2 + \sigma_Y^2 – 2 \, \text{Cov}(X, Y).
\]
Step 3: Correlation Coefficient
Recall the definition of the correlation coefficient:
\[
r(X, Y) = \frac{\text{Cov}(X, Y)}{\sigma_X \sigma_Y}.
\]
From the variance expansion:
\[
\sigma_{X-Y}^2 = \sigma_X^2 + \sigma_Y^2 – 2 \, \text{Cov}(X, Y).
\]
Rearrange to express \( \text{Cov}(X, Y) \) in terms of \( \sigma_X^2, \sigma_Y^2, \) and \( \sigma_{X-Y}^2 \):
\[
\text{Cov}(X, Y) = \frac{\sigma_X^2 + \sigma_Y^2 – \sigma_{X-Y}^2}{2}.
\]
Substitute into the formula for \( r(X, Y) \):
\[
r(X, Y) = \frac{\sigma_X^2 + \sigma_Y^2 – \sigma_{X-Y}^2}{2 \sigma_X \sigma_Y}.
\]
PDF Presentation
corelderivationVideo
Conclusion
The formula for the correlation coefficient \( r(X, Y) \) is derived successfully. This formula is crucial for understanding relationships in data science and statistics.