Gradient Descent

Intermediate Optimization
~6 min read Optimization
Prerequisites:

Definition

Gradient Descent is the fundamental optimization algorithm used to minimize differentiable loss functions in machine learning. It iteratively adjusts model parameters in the direction opposite to the gradient of the loss function with respect to the parameters. The core principle is simple yet powerful: compute the gradient (the direction of steepest ascent), then move in the opposite direction by a step size called the learning rate. This process continues until convergence to a local minimum is reached. Gradient descent forms the backbone of modern deep learning optimization, with variants like Stochastic Gradient Descent and Adam building upon this foundational concept. The algorithm is applicable whenever the loss function is differentiable and provides a systematic way to optimize millions or even billions of parameters in neural networks.

Intuition

💡

Imagine standing on a foggy mountainside and wanting to reach the valley floor. You cannot see the entire landscape, but you can feel the slope beneath your feet. Gradient descent works exactly like this: at each position, you determine which way is downhill (the negative gradient) and take a step in that direction. The size of your step is the learning rate—too small and you descend slowly; too large and you might overshoot the valley or even climb back up. In machine learning, the mountain represents the loss landscape where elevation corresponds to error, and your position represents the model parameters. The valleys are regions of low loss where the model makes accurate predictions. Batch gradient descent uses the entire dataset to compute the gradient at each step, giving you the truest sense of the overall slope but requiring significant computation per step.

Mathematical Formula

\[ \theta_{t+1} = \theta_t - \eta abla_\theta L(\theta_t) \]

Step-by-Step Explanation:

  1. Step 1: Initialize parameters \( h\(\eta_0\)\) randomly or with a specific strategy
  2. Step 2: Compute the gradient \nabla_\theta L(\theta_t) - the vector of partial derivatives of the loss with respect to each parameter
  3. Step 3: Scale the gradient by learning rate \(\(\eta\)\) to control step size
  4. Step 4: Update parameters by subtracting the scaled gradient from current parameters
  5. Step 5: Repeat until convergence criteria met (gradient magnitude < threshold, max iterations reached, or loss stops decreasing)

Real-World Use Cases

Linear Regression

Finding optimal weights that minimize the sum of squared errors between predicted and actual house prices based on features like square footage, bedrooms, and location.

Logistic Regression

Training a binary classifier for spam detection by optimizing weights to minimize cross-entropy loss on labeled email data.

Neural Networks

Training the weights and biases of a feedforward neural network for image classification, where gradient descent propagates error signals backward through the network.

Recommender Systems

Optimizing matrix factorization models for collaborative filtering to predict user ratings for movies they haven't watched.

Natural Language Processing

Training word embeddings using techniques like Word2Vec or GloVe by minimizing reconstruction loss over large text corpora.

Implementation

Manual Implementation (No Libraries)

The implementation computes the exact gradient using all training data (batch mode). At each iteration, it calculates predictions, computes the MSE loss, derives the gradient analytically, and updates parameters. The gradient for linear regression has a closed form, making this implementation efficient for moderate dataset sizes.
import numpy as np

def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000, tolerance=1e-6):
    """
    Batch Gradient Descent for linear regression.
    
    Args:
        X: Feature matrix (n_samples, n_features)
        y: Target vector (n_samples,)
        learning_rate: Step size for parameter updates
        n_iterations: Maximum number of iterations
        tolerance: Convergence threshold for gradient magnitude
    
    Returns:
        theta: Optimized parameters
        loss_history: Loss at each iteration
    """
    n_samples, n_features = X.shape
    
    # Initialize parameters randomly
    theta = np.random.randn(n_features) * 0.01
    loss_history = []
    
    for iteration in range(n_iterations):
        # Compute predictions
        y_pred = X @ theta
        
        # Compute loss (Mean Squared Error)
        loss = np.mean((y_pred - y) ** 2)
        loss_history.append(loss)
        
        # Compute gradient: (2/n) * X^T * (X*theta - y)
        gradient = (2 / n_samples) * X.T @ (y_pred - y)
        
        # Check convergence
        gradient_norm = np.linalg.norm(gradient)
        if gradient_norm < tolerance:
            print(f'Converged at iteration {iteration}')
            break
        
        # Update parameters
        theta = theta - learning_rate * gradient
    
    return theta, loss_history

# Example usage
np.random.seed(42)
X = np.random.randn(100, 3)
true_theta = np.array([2.0, -1.0, 0.5])
y = X @ true_theta + np.random.randn(100) * 0.1

theta_opt, losses = gradient_descent(X, y, learning_rate=0.1, n_iterations=1000)
print(f'Optimized parameters: {theta_opt}')
print(f'Final loss: {losses[-1]:.6f}')

Using Libraries (torch.optim.SGD, tensorflow.keras.optimizers.SGD)

import torch
import torch.nn as nn

# Define model, loss, and optimizer
model = nn.Linear(3, 1)  # Linear regression with 3 features
criterion = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)

# Training loop
X = torch.randn(100, 3)
y = torch.randn(100, 1)

losses = []
for epoch in range(1000):
    # Forward pass
    outputs = model(X)
    loss = criterion(outputs, y)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    
    # Parameter update
    optimizer.step()
    
    losses.append(loss.item())

print(f'Final loss: {losses[-1]:.6f}')

# TensorFlow/Keras equivalent
import tensorflow as tf

model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(3,))])
model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.1),
              loss='mse')

history = model.fit(X.numpy(), y.numpy(), epochs=1000, verbose=0)
print(f'Final loss: {history.history["loss"][-1]:.6f}')

When to Use

✅ Appropriate Use Cases:

  • When the dataset is small to medium-sized (fits in memory)
  • When deterministic, stable convergence is preferred over speed
  • For convex optimization problems where global minimum is guaranteed
  • When you need the true gradient for theoretical analysis
  • As a baseline to compare more complex optimizers against
  • When training simple models like linear/logistic regression
  • For educational purposes to understand optimization fundamentals

❌ Avoid When:

  • When working with large datasets that don't fit in memory
  • When training deep neural networks (use adaptive methods instead)
  • When fast iteration and experimentation are priorities
  • When the loss landscape is highly non-convex with many saddle points
  • When real-time training updates are needed (streaming data)
  • When memory is constrained (batch gradient requires storing all data)

Common Pitfalls

  • {'pitfall': 'Learning rate too high', 'description': 'Causes divergence where loss increases instead of decreasing. The optimizer overshoots the minimum and bounces around or diverges to infinity.', 'solution': 'Start with small learning rate (0.001-0.01), use learning rate scheduling, or implement learning rate decay.'}
  • {'pitfall': 'Learning rate too low', 'description': 'Causes extremely slow convergence, requiring many iterations to reach the minimum. May get stuck in flat regions.', 'solution': 'Use larger initial learning rate with decay, or implement adaptive learning rate methods.'}
  • {'pitfall': 'Convergence to local minimum', 'description': 'In non-convex landscapes, gradient descent may converge to a local minimum instead of the global minimum.', 'solution': 'Use multiple random initializations, add momentum, or use stochastic methods that can escape local minima.'}
  • {'pitfall': 'Feature scaling issues', 'description': 'Features with different scales cause the loss landscape to be elongated, making convergence slow along certain directions.', 'solution': 'Always standardize or normalize features (zero mean, unit variance) before applying gradient descent.'}
  • {'pitfall': 'Saddle points in high dimensions', 'description': 'In high-dimensional spaces, saddle points are more common than local minima and can trap gradient descent.', 'solution': 'Add momentum or use second-order methods that can escape saddle points more effectively.'}