Logistic Regression: Classification, Sigmoid, and Odds Ratios

Definition

Logistic Regression is the foundational classification algorithm in machine learning, despite its name suggesting regression. It models the probability that a given input belongs to a particular class using the logistic (sigmoid) function, which maps any real-valued number to the range (0, 1). The algorithm estimates coefficients through Maximum Likelihood Estimation (MLE), finding parameters that maximize the probability of observing the training data. For binary classification, the decision boundary occurs where the predicted probability equals 0.5, creating a linear separator in feature space. Logistic regression naturally outputs calibrated probabilities, making it valuable for applications requiring confidence scores. The log-odds (logit) transformation linearizes the relationship between features and the log-probability ratio, enabling interpretation through odds ratios - the multiplicative change in odds for a one-unit increase in a feature.

Intuition

💡

Imagine a dimmer switch that gradually transitions from off to on. Logistic regression is like finding the position where the switch flips, but instead of a sharp cutoff, it gives you a smooth probability curve. Think of it as drawing a line through your data that doesn't just predict categories, but tells you how confident it is - 'I'm 90% sure this is spam' versus 'This could go either way at 51%'.

Mathematical Formula

Sigmoid (Logistic) Function:

\sigma(z) = \frac{1}{1 + e^{-z}}

Probability of Positive Class:

P(y=1|x) = \frac{1}{1 + e^{-(\beta_0 + \beta^T x)}}

Log-Odds (Logit):

\ln\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + ... + \beta_n x_n

Cross-Entropy Loss (Binary):

L(\beta) = -\frac{1}{n}\sum_{i=1}^{n} [y_i \ln(p_i) + (1-y_i) \ln(1-p_i)]

Odds Ratio for feature j:

OR_j = e^{\beta_j}

Decision Boundary (where p=0.5):

\beta_0 + \beta^T x = 0

Step-by-Step Explanation:

The sigmoid function squishes any value into a probability between 0 and 1
Log-odds (logit) is linear in features, making it interpretable like linear regression
Cross-entropy loss penalizes confident wrong predictions more heavily than uncertain ones
Coefficients \(\beta\) represent the change in log-odds per unit change in the feature
Odds ratio \(e^\beta\) tells us how much the odds multiply for a one-unit feature increase
The decision boundary is linear - a hyperplane in feature space

Real-World Use Cases

Healthcare

Predicting diabetes risk based on age, BMI, blood pressure, and glucose levels. Odds ratios show which factors most increase risk.

Finance

Credit default prediction using income, debt-to-income ratio, and credit history. Output probabilities inform interest rate pricing.

Marketing

Customer churn prediction identifying subscribers likely to cancel. Probabilities prioritize retention efforts.

Security

Fraud detection flagging suspicious transactions. Calibrated probabilities reduce false alarms.

Implementation

Manual Implementation (No Libraries)

This implementation from scratch shows how gradient descent optimizes the cross-entropy loss. The sigmoid converts linear predictions to probabilities, and gradients are computed via chain rule. Odds ratios provide interpretable feature importance.

import numpy as np

class LogisticRegression:
    """
    Manual implementation of Logistic Regression using gradient descent.
    Demonstrates the core mathematics: sigmoid, log-loss, and optimization.
    """
    
    def __init__(self, learning_rate=0.1, max_iter=1000, tol=1e-6):
        self.lr = learning_rate
        self.max_iter = max_iter
        self.tol = tol
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def _sigmoid(self, z):
        """Numerically stable sigmoid function."""
        # Clip z to prevent overflow
        z = np.clip(z, -500, 500)
        return 1 / (1 + np.exp(-z))
    
    def _compute_loss(self, y_true, y_pred):
        """Binary cross-entropy loss."""
        # Add epsilon to prevent log(0)
        epsilon = 1e-15
        y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
        return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))
    
    def fit(self, X, y):
        """Train using gradient descent."""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.max_iter):
            # Forward pass
            linear_model = np.dot(X, self.weights) + self.bias
            y_pred = self._sigmoid(linear_model)
            
            # Compute loss
            loss = self._compute_loss(y, y_pred)
            self.loss_history.append(loss)
            
            # Backward pass (compute gradients)
            dw = (1 / n_samples) * np.dot(X.T, (y_pred - y))
            db = (1 / n_samples) * np.sum(y_pred - y)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            # Check convergence
            if i > 0 and abs(self.loss_history[-2] - loss) < self.tol:
                break
        
        return self
    
    def predict_proba(self, X):
        """Predict class probabilities."""
        linear_model = np.dot(X, self.weights) + self.bias
        return self._sigmoid(linear_model)
    
    def predict(self, X):
        """Predict class labels (0 or 1)."""
        return (self.predict_proba(X) >= 0.5).astype(int)
    
    def get_odds_ratios(self):
        """Return odds ratios (exp of coefficients)."""
        return np.exp(self.weights)

# Demonstration
if __name__ == '__main__':
    np.random.seed(42)
    
    # Generate synthetic binary classification data
    n_samples = 200
    X = np.random.randn(n_samples, 2)
    # True decision boundary: 2*x1 + 3*x2 > 0
    y = (2 * X[:, 0] + 3 * X[:, 1] > 0).astype(int)
    
    # Split data
    split = int(0.8 * n_samples)
    X_train, X_test = X[:split], X[split:]
    y_train, y_test = y[:split], y[split:]
    
    # Train model
    model = LogisticRegression(learning_rate=0.5, max_iter=1000)
    model.fit(X_train, y_train)
    
    # Evaluate
    y_pred = model.predict(X_test)
    accuracy = np.mean(y_pred == y_test)
    
    print(f'Accuracy: {accuracy:.3f}')
    print(f'Weights: {model.weights}')
    print(f'Bias: {model.bias:.3f}')
    print(f'Odds Ratios: {model.get_odds_ratios()}')

Using Libraries (scikit-learn, numpy, pandas, matplotlib)

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_breast_cancer, make_classification
from sklearn.metrics import (
    accuracy_score, classification_report, confusion_matrix,
    roc_auc_score, roc_curve, precision_recall_curve
)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Load real-world dataset (Breast Cancer Wisconsin)
data = load_breast_cancer()
X, y = data.data, data.target

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Scale features (essential for regularization)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Train Logistic Regression
# liblinear solver supports L1 and L2, saga supports all penalties
clf = LogisticRegression(
    penalty='l2',           # L2 regularization
    C=1.0,                  # Inverse of regularization strength
    solver='lbfgs',         # Optimization algorithm
    max_iter=1000,
    random_state=42
)
clf.fit(X_train_scaled, y_train)

# Predictions
y_pred = clf.predict(X_test_scaled)
y_prob = clf.predict_proba(X_test_scaled)[:, 1]

# Evaluation
print('=== Classification Performance ===')
print(f'Accuracy: {accuracy_score(y_test, y_pred):.3f}')
print(f'ROC-AUC: {roc_auc_score(y_test, y_prob):.3f}')
print('
Confusion Matrix:')
print(confusion_matrix(y_test, y_pred))
print('
Classification Report:')
print(classification_report(y_test, y_pred, target_names=data.target_names))

# Feature importance via coefficients
feature_df = pd.DataFrame({
    'feature': data.feature_names,
    'coefficient': clf.coef_[0],
    'abs_coefficient': np.abs(clf.coef_[0]),
    'odds_ratio': np.exp(clf.coef_[0])
}).sort_values('abs_coefficient', ascending=False)

print('
=== Top 10 Most Important Features (by |coefficient|) ===')
print(feature_df.head(10)[['feature', 'coefficient', 'odds_ratio']])

# Cross-validation
cv_scores = cross_val_score(clf, X_train_scaled, y_train, cv=5, scoring='roc_auc')
print(f'
=== Cross-Validation ROC-AUC ===')
print(f'{cv_scores.mean():.3f} (+/- {cv_scores.std() * 2:.3f})')

# Hyperparameter tuning
param_grid = {'C': [0.001, 0.01, 0.1, 1, 10, 100], 'penalty': ['l1', 'l2']}
# Note: saga solver supports both L1 and L2
grid_search = GridSearchCV(
    LogisticRegression(solver='saga', max_iter=2000, random_state=42),
    param_grid,
    cv=5,
    scoring='roc_auc'
)
grid_search.fit(X_train_scaled, y_train)
print(f'
=== Best Parameters ===')
print(grid_search.best_params_)
print(f'Best CV ROC-AUC: {grid_search.best_score_:.3f}')

When to Use

✅ Appropriate Use Cases:

Binary classification with need for probability estimates
Interpretable model requirements (coefficients show feature impact on log-odds)
Baseline for classification tasks before trying complex models
When you need odds ratios (healthcare, social sciences)
Linearly separable or nearly linearly separable data
Well-calibrated probability outputs needed (medical diagnosis, risk scoring)

❌ Avoid When:

Complex non-linear decision boundaries (use SVM with kernels or tree-based models)
High-dimensional sparse data with many irrelevant features (try L1 regularization first)
Regression problems (despite the name, this is for classification)
Severe class imbalance without proper weighting or resampling
When features have complex interactions (neural networks or ensembles may work better)

Common Pitfalls

Not scaling features: Regularization assumes all features on same scale
Complete separation: Perfect prediction causes coefficient estimates to diverge
Multicollinearity: Highly correlated features inflate standard errors
Class imbalance: Skewed classes bias toward majority class
Ignoring probability calibration: Raw scores may not reflect true probabilities
Linear assumption: Model assumes linear relationship in log-odds space
Convergence warnings: Increase max_iter or check for perfect separation