Linear Regression: Theory, Assumptions, and Diagnostics

Definition

Linear Regression is the foundational algorithm of statistical modeling and machine learning, establishing a linear relationship between a dependent variable and one or more independent variables. The algorithm assumes that the target variable can be expressed as a weighted sum of input features plus an error term. Mathematically, this is expressed as y = Xβ + ε, where y is the target vector, X is the design matrix of features, \(\beta\) represents the coefficients (parameters to learn), and ε captures random noise. Ordinary Least Squares (OLS) is the standard method for estimating coefficients by minimizing the sum of squared residuals. Despite its simplicity, linear regression remains indispensable due to its interpretability, computational efficiency, and role as a baseline for more complex models. Understanding its assumptions - linearity, independence, homoscedasticity, and normality of residuals - is crucial for proper application.

Intuition

💡

Imagine throwing darts at a dartboard and wanting to find the straight line that passes as close as possible to all the holes. Linear regression finds the 'line of best fit' through your data points by minimizing the total squared vertical distance from each point to the line. Think of it like adjusting a seesaw: you're finding the perfect balance point (intercept) and slope that best explains how your input variables relate to your output.

Mathematical Formula

Model Equation:

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_n x_n + \epsilon

Matrix Form:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

Ordinary Least Squares (Closed Form):

\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Mean Squared Error (Loss):

MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2 = \frac{1}{n}\sum_{i=1}^{n}\epsilon_i^2

R-squared (Coefficient of Determination):

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{\sum(y_i - \hat{y}_i)^2}{\sum(y_i - \bar{y})^2}

Step-by-Step Explanation:

β₀ is the intercept: the predicted value when all features equal zero
βᵢ represents the change in y for a one-unit change in feature xᵢ, holding other features constant
The OLS formula finds coefficients that minimize the sum of squared residuals
XᵀX must be invertible (no perfect multicollinearity among features)
MSE measures average squared prediction error - lower is better
R² indicates the proportion of variance explained (0 to 1, higher is better)

Real-World Use Cases

Real Estate

Predicting house prices based on square footage, bedrooms, location, and age. Coefficients reveal how much each feature adds to the price in dollars.

Economics

Forecasting GDP growth using leading indicators like unemployment rates, consumer confidence, and manufacturing indices.

Marketing

Estimating customer lifetime value based on acquisition channel, first purchase amount, and engagement metrics.

Healthcare

Predicting patient length of stay using admission vitals, diagnosis codes, and comorbidities for resource planning.

Implementation

Manual Implementation (No Libraries)

This implementation demonstrates the closed-form OLS solution using matrix operations. The pseudo-inverse ensures numerical stability. Key diagnostics (MSE, R², residuals) help assess model quality and verify assumptions.

import numpy as np
import matplotlib.pyplot as plt

class LinearRegression:
    """
    Manual implementation of Linear Regression using Ordinary Least Squares.
    Includes fit, predict, and diagnostic methods.
    """
    
    def __init__(self, fit_intercept=True):
        self.fit_intercept = fit_intercept
        self.coef_ = None
        self.intercept_ = None
        self.residuals_ = None
        self.mse_ = None
        self.r_squared_ = None
    
    def fit(self, X, y):
        """Fit linear model using OLS."""
        # Add intercept column if needed
        if self.fit_intercept:
            X_with_intercept = np.column_stack([np.ones(X.shape[0]), X])
        else:
            X_with_intercept = X
        
        # Closed-form solution: \(\beta\) = (XᵀX)⁻¹Xᵀy
        # Use pseudo-inverse for numerical stability
        self.beta_ = np.linalg.pinv(X_with_intercept.T @ X_with_intercept) @ X_with_intercept.T @ y
        
        if self.fit_intercept:
            self.intercept_ = self.beta_[0]
            self.coef_ = self.beta_[1:]
        else:
            self.intercept_ = 0
            self.coef_ = self.beta_
        
        # Calculate predictions and diagnostics
        y_pred = self.predict(X)
        self.residuals_ = y - y_pred
        self.mse_ = np.mean(self.residuals_ ** 2)
        ss_res = np.sum(self.residuals_ ** 2)
        ss_tot = np.sum((y - np.mean(y)) ** 2)
        self.r_squared_ = 1 - (ss_res / ss_tot) if ss_tot > 0 else 0
        
        return self
    
    def predict(self, X):
        """Predict using the linear model."""
        if self.coef_ is None:
            raise ValueError('Model must be fitted before prediction')
        return self.intercept_ + X @ self.coef_
    
    def diagnostics(self):
        """Return model diagnostics."""
        return {
            'mse': self.mse_,
            'rmse': np.sqrt(self.mse_),
            'r_squared': self.r_squared_,
            'residual_std': np.std(self.residuals_)
        }

# Demonstration
if __name__ == '__main__':
    np.random.seed(42)
    
    # Generate synthetic data: y = 2 + 3x + noise
    X = np.random.randn(100, 1)
    y = 2 + 3 * X.ravel() + np.random.randn(100) * 0.5
    
    # Fit model
    model = LinearRegression()
    model.fit(X, y)
    
    print(f'Intercept: {model.intercept_:.3f} (expected: 2.0)')
    print(f'Coefficient: {model.coef_[0]:.3f} (expected: 3.0)')
    print(f'R²: {model.r_squared_:.3f}')
    print(f'RMSE: {np.sqrt(model.mse_):.3f}')

Using Libraries (scikit-learn, numpy, pandas)

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_california_housing
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np
import pandas as pd

# Load California housing dataset (real-world regression data)
housing = fetch_california_housing()
X, y = housing.data, housing.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Scale features (important for regularized versions)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Standard Linear Regression
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

print('=== Standard Linear Regression ===')
print(f'R² Score: {r2_score(y_test, y_pred):.3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.3f}')

# Feature importance (coefficients)
feature_importance = pd.DataFrame({
    'feature': housing.feature_names,
    'coefficient': lr.coef_
}).sort_values('coefficient', key=abs, ascending=False)
print('
Top Features by Importance:')
print(feature_importance.head())

# Regularized Regression - Ridge (L2 penalty)
ridge = Ridge(alpha=1.0)
ridge.fit(X_train_scaled, y_train)
print(f'
=== Ridge Regression (L2) ===')
print(f'R² Score: {ridge.score(X_test_scaled, y_test):.3f}')

# Regularized Regression - Lasso (L1 penalty)
lasso = Lasso(alpha=0.1)
lasso.fit(X_train_scaled, y_train)
print(f'
=== Lasso Regression (L1) ===')
print(f'R² Score: {lasso.score(X_test_scaled, y_test):.3f}')
print(f'Non-zero coefficients: {np.sum(lasso.coef_ != 0)}')

# Cross-validation for robust evaluation
cv_scores = cross_val_score(lr, X, y, cv=5, scoring='r2')
print(f'
=== Cross-Validation ===')
print(f'CV R²: {cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')

When to Use

✅ Appropriate Use Cases:

Baseline modeling: Establish a performance floor before trying complex models
Interpretability required: Coefficients directly show feature impact
Linear relationship between features and target (verify with residual plots)
Small to medium datasets where overfitting is a concern
Feature selection with Lasso regularization
Economic or scientific modeling where theoretical relationships are linear

❌ Avoid When:

Clear non-linear relationships (use polynomial features or non-linear models)
High multicollinearity without regularization (Ridge recommended)
Categorical targets (use Logistic Regression instead)
Complex interactions between features (consider tree-based models)
Outliers present without robust regression techniques

Common Pitfalls

Ignoring assumptions: Not checking linearity, homoscedasticity, normality of residuals
Multicollinearity: Highly correlated features inflate variance and reduce interpretability
Outliers: Extreme values disproportionately influence OLS estimates
Extrapolation: Predicting outside the range of training data is unreliable
Omitted variable bias: Missing relevant correlated features biases coefficients
Data leakage: Including derived features that incorporate the target
Not scaling features: Required for regularized versions (Ridge, Lasso, ElasticNet)