Correlation Analysis: Measuring Relationships Between Variables

Intermediate Eda

~2 min read Eda

Prerequisites:

Descriptive Statistics: Measures of Central Tendency and Dispersion basic-python

Definition

Correlation analysis measures the strength and direction of the linear relationship between two variables. It quantifies how much variables change together - when one increases, does the other tend to increase (positive correlation), decrease (negative correlation), or show no pattern (no correlation)? The correlation coefficient ranges from -1 (perfect negative correlation) through 0 (no correlation) to +1 (perfect positive correlation). The most common measure is Pearson's correlation coefficient, which assumes linear relationships and normally distributed variables.

Intuition

💡

Imagine watching two people on a dance floor. If they move together perfectly - when one steps left, the other steps left - they have perfect positive correlation (+1). If they move oppositely - when one steps left, the other steps right - they have perfect negative correlation (-1). If they dance independently with no pattern, correlation is near 0.

Mathematical Formula

Pearson Correlation Coefficient (r):

\quad r = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n} (x_i - \bar{x})^2 \sum_{i=1}^{n} (y_i - \bar{y})^2}}

Covariance:

\quad Cov(X,Y) = \frac{1}{n-1} \sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})

R-squared:

\quad R^2 = r^2

Step-by-Step Explanation:

Pearson r numerator: Sum of cross-products of deviations from means.
Pearson r denominator: Product of standard deviations, normalizing to -1 to +1 scale.
R-squared: Square of Pearson r. Indicates percentage of variance explained.

Interactive Demo

Scatter plot: correlated vs. uncorrelated Example Data

Real-World Use Cases

Finance

Portfolio managers calculate correlations between asset returns to diversify investments. Assets with low or negative correlations reduce overall portfolio risk.

Healthcare

Researchers study correlations between lifestyle factors and health outcomes. Correlation identifies risk factors.

Marketing

Analysts correlate ad spend with sales to measure ROI. High correlation suggests effective advertising.

Implementation

Manual Implementation (No Libraries)

The Pearson implementation calculates covariance in the numerator and the product of standard deviations in the denominator.

import math

def pearson_correlation(x, y):
    n = len(x)
    mean_x = sum(x) / n
    mean_y = sum(y) / n
    numerator = sum((xi - mean_x) * (yi - mean_y) for xi, yi in zip(x, y))
    sum_sq_x = sum((xi - mean_x)**2 for xi in x)
    sum_sq_y = sum((yi - mean_y)**2 for yi in y)
    denominator = math.sqrt(sum_sq_x * sum_sq_y)
    return numerator / denominator if denominator != 0 else 0

x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [2, 4, 5, 7, 8, 10, 11, 13, 14, 16]
print(f'Pearson r: {pearson_correlation(x, y):.4f}')

Using Libraries (numpy, scipy)

import numpy as np
from scipy import stats

x = np.random.randn(100)
y = 2 * x + np.random.randn(100) * 0.5
r, p = stats.pearsonr(x, y)
print(f'Pearson r: {r:.4f}, p-value: {p:.4f}')

When to Use

✅ Appropriate Use Cases:

Pearson: Use when both variables are continuous, relationship is linear.
Spearman: Use when variables are ordinal or relationship is monotonic.
Correlation matrix: When exploring relationships among many variables.

❌ Avoid When:

Never use correlation to imply causation.
Do not use Pearson for non-linear relationships.
Do not ignore confounding variables.

Common Pitfalls

Correlation does not imply causation.
Outliers can distort correlation.
Range restriction can hide true correlations.

Previous Categorical Data Analysis: Analyzing Discrete Variables Next Descriptive Statistics: Measures of Central Tendency and Dispersion