Linear Regression: Theory, Assumptions, and Diagnostics
Definition
Linear Regression is the foundational algorithm of statistical modeling and machine learning, establishing a linear relationship between a dependent variable and one or more independent variables. The algorithm assumes that the target variable can be expressed as a weighted sum of input features plus an error term. Mathematically, this is expressed as y = Xβ + ε, where y is the target vector, X is the design matrix of features, \(\beta\) represents the coefficients (parameters to learn), and ε captures random noise. Ordinary Least Squares (OLS) is the standard method for estimating coefficients by minimizing the sum of squared residuals. Despite its simplicity, linear regression remains indispensable due to its interpretability, computational efficiency, and role as a baseline for more complex models. Understanding its assumptions - linearity, independence, homoscedasticity, and normality of residuals - is crucial for proper application.
Intuition
Imagine throwing darts at a dartboard and wanting to find the straight line that passes as close as possible to all the holes. Linear regression finds the 'line of best fit' through your data points by minimizing the total squared vertical distance from each point to the line. Think of it like adjusting a seesaw: you're finding the perfect balance point (intercept) and slope that best explains how your input variables relate to your output.
Mathematical Formula
Step-by-Step Explanation:
- β₀ is the intercept: the predicted value when all features equal zero
- βᵢ represents the change in y for a one-unit change in feature xᵢ, holding other features constant
- The OLS formula finds coefficients that minimize the sum of squared residuals
- XᵀX must be invertible (no perfect multicollinearity among features)
- MSE measures average squared prediction error - lower is better
- R² indicates the proportion of variance explained (0 to 1, higher is better)
Real-World Use Cases
Predicting house prices based on square footage, bedrooms, location, and age. Coefficients reveal how much each feature adds to the price in dollars.
Forecasting GDP growth using leading indicators like unemployment rates, consumer confidence, and manufacturing indices.
Estimating customer lifetime value based on acquisition channel, first purchase amount, and engagement metrics.
Predicting patient length of stay using admission vitals, diagnosis codes, and comorbidities for resource planning.
Implementation
Manual Implementation (No Libraries)
import numpy as np
import matplotlib.pyplot as plt
class LinearRegression:
"""
Manual implementation of Linear Regression using Ordinary Least Squares.
Includes fit, predict, and diagnostic methods.
"""
def __init__(self, fit_intercept=True):
self.fit_intercept = fit_intercept
self.coef_ = None
self.intercept_ = None
self.residuals_ = None
self.mse_ = None
self.r_squared_ = None
def fit(self, X, y):
"""Fit linear model using OLS."""
# Add intercept column if needed
if self.fit_intercept:
X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])
else:
X_with_intercept = X
# Closed-form solution: \(\beta\) = (XᵀX)⁻¹Xᵀy
# Use pseudo-inverse for numerical stability
self.beta_ = np.linalg.pinv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y
if self.fit_intercept:
self.intercept_ = self.beta_[0]
self.coef_ = self.beta_[1:]
else:
self.intercept_ = 0
self.coef_ = self.beta_
# Calculate predictions and diagnostics
y_pred = self.predict(X)
self.residuals_ = y - y_pred
self.mse_ = np.mean(self.residuals_ ** 2)
ss_res = np.sum(self.residuals_ ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
self.r_squared_ = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0
return self
def predict(self, X):
"""Predict using the linear model."""
if self.coef_ is None:
raise ValueError('Model must be fitted before prediction')
return self.intercept_ + X @ self.coef_
def diagnostics(self):
"""Return model diagnostics."""
return {
'mse': self.mse_,
'rmse': np.sqrt(self.mse_),
'r_squared': self.r_squared_,
'residual_std': np.std(self.residuals_)
}
# Demonstration
if __name__ == '__main__':
np.random.seed(42)
# Generate synthetic data: y = 2 + 3x + noise
X = np.random.randn(100, 1)
y = 2 + 3 * X.ravel() + np.random.randn(100) * 0.5
# Fit model
model = LinearRegression()
model.fit(X, y)
print(f'Intercept: {model.intercept_:.3f} (expected: 2.0)')
print(f'Coefficient: {model.coef_[0]:.3f} (expected: 3.0)')
print(f'R²: {model.r_squared_:.3f}')
print(f'RMSE: {np.sqrt(model.mse_):.3f}')
Using Libraries (scikit-learn, numpy, pandas)
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd
# Load California housing dataset (real-world regression data)
housing = fetch_california_housing()
X, y = housing.data, housing.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Scale features (important for regularized versions)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Standard Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)
print('=== Standard Linear Regression ===')
print(f'R² Score: {r2_score(y_test, y_pred):.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}')
# Feature importance (coefficients)
feature_importance = pd.DataFrame({
'feature': housing.feature_names,
'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print('
Top Features by Importance:')
print(feature_importance.head())
# Regularized Regression - Ridge (L2 penalty)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
print(f'
=== Ridge Regression (L2) ===')
print(f'R² Score: {ridge.score(X_test_scaled, y_test):.3f}')
# Regularized Regression - Lasso (L1 penalty)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
print(f'
=== Lasso Regression (L1) ===')
print(f'R² Score: {lasso.score(X_test_scaled, y_test):.3f}')
print(f'Non-zero coefficients: {np.sum(lasso.coef_ != 0)}')
# Cross-validation for robust evaluation
cv_scores = cross_val_score(lr, X, y, cv=5, scoring='r2')
print(f'
=== Cross-Validation ===')
print(f'CV R²: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')
When to Use
✅ Appropriate Use Cases:
- Baseline modeling: Establish a performance floor before trying complex models
- Interpretability required: Coefficients directly show feature impact
- Linear relationship between features and target (verify with residual plots)
- Small to medium datasets where overfitting is a concern
- Feature selection with Lasso regularization
- Economic or scientific modeling where theoretical relationships are linear
❌ Avoid When:
- Clear non-linear relationships (use polynomial features or non-linear models)
- High multicollinearity without regularization (Ridge recommended)
- Categorical targets (use Logistic Regression instead)
- Complex interactions between features (consider tree-based models)
- Outliers present without robust regression techniques
Common Pitfalls
- Ignoring assumptions: Not checking linearity, homoscedasticity, normality of residuals
- Multicollinearity: Highly correlated features inflate variance and reduce interpretability
- Outliers: Extreme values disproportionately influence OLS estimates
- Extrapolation: Predicting outside the range of training data is unreliable
- Omitted variable bias: Missing relevant correlated features biases coefficients
- Data leakage: Including derived features that incorporate the target
- Not scaling features: Required for regularized versions (Ridge, Lasso, ElasticNet)