Correlation Analysis: Measuring Relationships Between Variables
Definition
Correlation analysis measures the strength and direction of the linear relationship between two variables. It quantifies how much variables change together - when one increases, does the other tend to increase (positive correlation), decrease (negative correlation), or show no pattern (no correlation)? The correlation coefficient ranges from -1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation). The most common measure is Pearson's correlation coefficient, which assumes linear relationships and normally distributed variables.
Intuition
Imagine watching two people on a dance floor. If they move together perfectly - when one steps left, the other steps left - they have perfect positive correlation (+1). If they move oppositely - when one steps left, the other steps right - they have perfect negative correlation (-1). If they dance independently with no pattern, correlation is near 0.
Mathematical Formula
Step-by-Step Explanation:
- Pearson r numerator: Sum of cross-products of deviations from means.
- Pearson r denominator: Product of standard deviations, normalizing to -1 to +1 scale.
- R-squared: Square of Pearson r. Indicates percentage of variance explained.
Interactive Demo
Real-World Use Cases
Portfolio managers calculate correlations between asset returns to diversify investments. Assets with low or negative correlations reduce overall portfolio risk.
Researchers study correlations between lifestyle factors and health outcomes. Correlation identifies risk factors.
Analysts correlate ad spend with sales to measure ROI. High correlation suggests effective advertising.
Implementation
Manual Implementation (No Libraries)
import math
def pearson_correlation(x, y):
n = len(x)
mean_x = sum(x) / n
mean_y = sum(y) / n
numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
sum_sq_x = sum((xi - mean_x)**2 for xi in x)
sum_sq_y = sum((yi - mean_y)**2 for yi in y)
denominator = math.sqrt(sum_sq_x * sum_sq_y)
return numerator / denominator if denominator != 0 else 0
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]
print(f'Pearson r: {pearson_correlation(x, y):.4f}')
Using Libraries (numpy, scipy)
import numpy as np
from scipy import stats
x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
r, p = stats.pearsonr(x, y)
print(f'Pearson r: {r:.4f}, p-value: {p:.4f}')
When to Use
✅ Appropriate Use Cases:
- Pearson: Use when both variables are continuous, relationship is linear.
- Spearman: Use when variables are ordinal or relationship is monotonic.
- Correlation matrix: When exploring relationships among many variables.
❌ Avoid When:
- Never use correlation to imply causation.
- Do not use Pearson for non-linear relationships.
- Do not ignore confounding variables.
Common Pitfalls
- Correlation does not imply causation.
- Outliers can distort correlation.
- Range restriction can hide true correlations.