Relationship Plots: Visualizing Associations Between Variables
Definition
Relationship plots visualize associations between two or more variables, revealing patterns, trends, clusters, and outliers that numerical summaries alone cannot capture. While correlation coefficients quantify linear relationships, plots show the full picture - revealing non-linear patterns, heteroscedasticity, influential outliers, and clusters indicating subgroups.
Intuition
Imagine you are a detective examining clues. Each variable is a witness. A scatter plot places two witnesses' testimonies side by side - you see if they align, contradict, or are unrelated. But plots show more: if the relationship is straight or curved, and if there are suspicious outliers.
Mathematical Formula
Step-by-Step Explanation:
- Linear regression: Fits a straight line minimizing sum of squared residuals.
- Slope: Change in y per unit change in x.
- R-squared: Proportion of variance in y explained by x.
Interactive Demo
Real-World Use Cases
Scatter plots reveal the relationship between house size and price, often showing non-linear patterns.
Scatter plots of BMI vs blood pressure reveal clusters and relationships.
Correlation heatmaps of asset returns guide portfolio diversification.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import matplotlib.pyplot as plt
def linear_regression(x, y):
n = len(x)
x_mean = np.mean(x)
y_mean = np.mean(y)
slope = sum((xi - x_mean) * (yi - y_mean) for xi, yi in zip(x, y)) / sum((xi - x_mean)**2 for xi in x)
intercept = y_mean - slope * x_mean
return slope, intercept
np.random.seed(42)
x = np.random.randn(100)
y = 2*x + np.random.randn(100) * 0.5
slope, intercept = linear_regression(x, y)
print(f'y = {slope:.2f}x + {intercept:.2f}')
plt.scatter(x, y, alpha=0.6)
plt.plot(x, slope*x + intercept, 'r-')
plt.savefig(f'{output_dir}/scatter_regression.png')
Using Libraries (numpy, pandas, matplotlib, seaborn)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
np.random.seed(42)
df = pd.DataFrame({'x': np.random.randn(100), 'y': 2*np.random.randn(100) + np.random.randn(100)})
sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})
plt.title('Scatter with Regression')
plt.show()
sns.heatmap(df.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Heatmap')
plt.show()
When to Use
✅ Appropriate Use Cases:
- Scatter plot: When exploring relationships between two continuous variables
- Hexbin plot: When dealing with large datasets
- Bubble chart: When adding a third dimension through size
- Heatmap: When examining relationships among many variables
❌ Avoid When:
- Do not use scatter plots with >10000 points without transparency
- Do not assume linearity from scatter plots
- Do not confuse correlation with causation
Common Pitfalls
- Overplotting with large datasets
- Ignoring outliers
- Confusing correlation with causation