AdaGrad, RMSprop, and Adam

Definition

AdaGrad, RMSprop, and Adam represent the evolution of adaptive learning rate methods that automatically adjust the learning rate for each parameter based on the historical gradients. Unlike SGD with a global learning rate, these methods maintain per-parameter learning rates that adapt to the geometry of the loss landscape. AdaGrad (2011) was the pioneer, adapting learning rates based on the sum of squared historical gradients—giving smaller updates to frequently occurring features. RMSprop (2012) improved upon AdaGrad by using an exponentially decaying average instead of cumulative sum, preventing the learning rate from becoming too small. Adam (2014) combined the best of both worlds: adaptive learning rates from RMSprop plus momentum, making it the most popular optimizer in deep learning today. These methods are particularly effective for sparse gradients and non-stationary objectives.

Intuition

💡

Imagine walking through a landscape where some directions are steep and well-defined while others are flat and uncertain. Traditional SGD takes steps of the same size in all directions. Adaptive methods are like having smart boots that automatically adjust step size: in directions where the slope has been consistently steep (like ravines), take smaller, careful steps to avoid overshooting; in directions where the terrain has been flat or sparse (like plateaus), take larger, bolder steps to cover more ground quickly. AdaGrad remembers every slope you've ever seen, making it conservative over time. RMSprop has a 'forgetting factor'—it only remembers recent slopes, staying adaptive throughout training. Adam adds memory of your velocity (like momentum), so you keep rolling in directions that have been consistently downhill while still adapting step sizes. It's like having both smart boots and a rolling start.

Mathematical Formula

\text{AdaGrad:} \quad G_t = G_{t-1} + g_t^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} g_t \text{RMSprop:} \quad v_t = \beta v_{t-1} + (1-\beta) g_t^2, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{v_t} + \epsilon} g_t \text{Adam:} \quad m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \quad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t

Step-by-Step Explanation:

AdaGrad Step 1: Accumulate squared gradients G_t = G_{t-1} + g_t^2 for each parameter
AdaGrad Step 2: Adapt learning rate inversely proportional to sqrt of accumulated squared gradients
RMSprop Step 1: Maintain exponential moving average of squared gradients v_t
RMSprop Step 2: Beta (typically 0.9) controls decay rate—higher means longer memory
RMSprop Step 3: Use sqrt of moving average to adapt per-parameter learning rates
Adam Step 1: Maintain twoEMAs: m_t (first moment—gradient mean) and v_t (second moment—gradient variance)
Adam Step 2: Apply bias correction to counteract zero initialization: divide by (1-beta^t)
Adam Step 3: Update using adaptive learning rate on bias-corrected momentum
Epsilon (typically 1e-8): Small constant for numerical stability to avoid division by zero

Real-World Use Cases

Natural Language Processing

Training transformer models (BERT, GPT) where gradients are sparse across embedding dimensions. AdaGrad's per-parameter adaptation helps with rare word embeddings that receive infrequent updates.

Recommendation Systems

Training matrix factorization models with sparse user-item interactions. RMSprop/Adam handle varying gradient frequencies across different user/item latent factors.

Computer Vision

Training CNNs like ResNet, EfficientNet where Adam is the default choice. Automatic learning rate adaptation reduces need for extensive hyperparameter tuning.

Reinforcement Learning

Training policy and value networks where gradients are noisy due to environment stochasticity. Adam's momentum smoothing and adaptive rates stabilize training.

Transfer Learning

Fine-tuning pretrained models where different layers need different update speeds. Adam automatically handles varying gradient magnitudes across layers.

Multi-Task Learning

Training models with multiple loss components that may have vastly different gradient scales. Adaptive methods balance updates across tasks automatically.

Implementation

Manual Implementation (No Libraries)

Three implementations showing the progression: AdaGrad accumulates all squared gradients (can shrink to zero), RMSprop uses exponential moving average (prevents shrinkage), Adam adds momentum and bias correction. All adapt learning rates per parameter based on gradient history.

import numpy as np

def adagrad_optimizer(X, y, lr=0.01, epsilon=1e-8, n_epochs=100, batch_size=32):
    """AdaGrad optimizer."""
    n_samples, n_features = X.shape
    theta = np.random.randn(n_features) * 0.01
    accumulated_squared = np.zeros(n_features)
    loss_history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        epoch_loss = 0
        
        for i in range(0, n_samples, batch_size):
            X_batch = X[indices[i:i+batch_size]]
            y_batch = y[indices[i:i+batch_size]]
            
            y_pred = X_batch @ theta
            gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
            
            # Accumulate squared gradients
            accumulated_squared += gradient ** 2
            
            # Adaptive update
            adapted_lr = lr / (np.sqrt(accumulated_squared) + epsilon)
            theta = theta - adapted_lr * gradient
            
            epoch_loss += np.mean((y_pred - y_batch) ** 2)
        
        loss_history.append(epoch_loss / (n_samples // batch_size))
    
    return theta, loss_history

def rmsprop_optimizer(X, y, lr=0.001, beta=0.9, epsilon=1e-8, 
                      n_epochs=100, batch_size=32):
    """RMSprop optimizer."""
    n_samples, n_features = X.shape
    theta = np.random.randn(n_features) * 0.01
    moving_avg_squared = np.zeros(n_features)
    loss_history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        epoch_loss = 0
        
        for i in range(0, n_samples, batch_size):
            X_batch = X[indices[i:i+batch_size]]
            y_batch = y[indices[i:i+batch_size]]
            
            y_pred = X_batch @ theta
            gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
            
            # Exponential moving average of squared gradients
            moving_avg_squared = beta * moving_avg_squared + (1 - beta) * (gradient ** 2)
            
            # Adaptive update
            adapted_lr = lr / (np.sqrt(moving_avg_squared) + epsilon)
            theta = theta - adapted_lr * gradient
            
            epoch_loss += np.mean((y_pred - y_batch) ** 2)
        
        loss_history.append(epoch_loss / (n_samples // batch_size))
    
    return theta, loss_history

def adam_optimizer(X, y, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
                   n_epochs=100, batch_size=32):
    """Adam optimizer with bias correction."""
    n_samples, n_features = X.shape
    theta = np.random.randn(n_features) * 0.01
    m = np.zeros(n_features)  # First moment
    v = np.zeros(n_features)  # Second moment
    loss_history = []
    
    for epoch in range(n_epochs):
        indices = np.random.permutation(n_samples)
        epoch_loss = 0
        t = 0
        
        for i in range(0, n_samples, batch_size):
            t += 1
            X_batch = X[indices[i:i+batch_size]]
            y_batch = y[indices[i:i+batch_size]]
            
            y_pred = X_batch @ theta
            gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
            
            # Update biased first moment estimate
            m = beta1 * m + (1 - beta1) * gradient
            # Update biased second raw moment estimate
            v = beta2 * v + (1 - beta2) * (gradient ** 2)
            
            # Compute bias-corrected estimates
            m_hat = m / (1 - beta1 ** t)
            v_hat = v / (1 - beta2 ** t)
            
            # Update parameters
            theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
            
            epoch_loss += np.mean((y_pred - y_batch) ** 2)
        
        loss_history.append(epoch_loss / (n_samples // batch_size))
    
    return theta, loss_history

# Comparison
np.random.seed(42)
X = np.random.randn(1000, 10)
true_theta = np.random.randn(10) * 2
y = X @ true_theta + np.random.randn(1000) * 0.5

print('=== AdaGrad ===')
theta_adagrad, losses_adagrad = adagrad_optimizer(X, y, lr=0.1, n_epochs=50)
print(f'Final loss: {losses_adagrad[-1]:.6f}')

print('
=== RMSprop ===')
theta_rmsprop, losses_rmsprop = rmsprop_optimizer(X, y, lr=0.01, n_epochs=50)
print(f'Final loss: {losses_rmsprop[-1]:.6f}')

print('
=== Adam ===')
theta_adam, losses_adam = adam_optimizer(X, y, lr=0.01, n_epochs=50)
print(f'Final loss: {losses_adam[-1]:.6f}')

Using Libraries (torch.optim.Adagrad, torch.optim.RMSprop, torch.optim.Adam, tensorflow.keras.optimizers.Adam)

import torch
import torch.nn as nn

# Setup
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)
dataset = torch.utils.data.TensorDataset(X, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)

optimizers = {
    'AdaGrad': lambda m: torch.optim.Adagrad(m.parameters(), lr=0.1),
    'RMSprop': lambda m: torch.optim.RMSprop(m.parameters(), lr=0.001, alpha=0.9),
    'Adam': lambda m: torch.optim.Adam(m.parameters(), lr=0.001, 
                                       betas=(0.9, 0.999))
}

criterion = nn.MSELoss()

for name, opt_fn in optimizers.items():
    print(f'
=== PyTorch {name} ===')
    model = nn.Linear(10, 1)
    optimizer = opt_fn(model)
    
    for epoch in range(50):
        epoch_loss = 0
        for batch_X, batch_y in dataloader:
            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criterion(outputs, batch_y)
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        
        if epoch % 10 == 0:
            print(f'Epoch {epoch}: Loss = {epoch_loss/len(dataloader):.6f}')

# TensorFlow
import tensorflow as tf

print('
=== TensorFlow Adam ===')
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(10,))])
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
    loss='mse'
)
history = model.fit(X.numpy(), y.numpy(), epochs=50, verbose=0)
print(f'Final loss: {history.history["loss"][-1]:.6f}')

When to Use

✅ Appropriate Use Cases:

AdaGrad: For sparse data with rare features (NLP with rare words, recommendation systems)
AdaGrad: When you want aggressive learning rate decay for frequently updated parameters
RMSprop: For non-stationary objectives where you need continuous adaptation
RMSprop: Recurrent neural networks, sequence modeling tasks
Adam: General-purpose deep learning (most popular default)
Adam: When you want minimal hyperparameter tuning (works well with defaults)
Adam: Large models where manual LR tuning per layer is impractical
Any adaptive method: When gradients vary significantly across parameters or time

❌ Avoid When:

AdaGrad: For dense gradients or when training many epochs (LR becomes too small)
AdaGrad: Non-convex deep learning (RMSprop or Adam preferred)
RMSprop: When you need the theoretical guarantees of AdaGrad for convex problems
Adam: When training transformer models from scratch (often SGD+momentum works better)
Adam: For some computer vision tasks where carefully tuned SGD+momentum outperforms
Adam: When generalization is critical (may overfit compared to properly tuned SGD)
Any adaptive method: When you have very limited compute (higher memory overhead)

Common Pitfalls

{'pitfall': 'AdaGrad learning rate collapse', 'description': 'Accumulated squared gradients grow indefinitely, causing learning rate to shrink to near zero, stopping learning prematurely.', 'solution': 'Use RMSprop or Adam instead for deep learning; AdaGrad is mainly for sparse convex problems.'}
{'pitfall': "Adam's epsilon sensitivity", 'description': 'Default epsilon=1e-8 may be too small for some applications, causing numerical instability.', 'solution': 'Increase epsilon to 1e-4 or 1e-3 for tasks with small gradients, like NLP or some RL applications.'}
{'pitfall': 'Adam with weight decay confusion', 'description': "Standard L2 regularization doesn't work correctly with Adam's adaptive learning rates.", 'solution': 'Use AdamW which decouples weight decay from gradient updates, or use proper weight decay implementation.'}
{'pitfall': 'Poor generalization with Adam', 'description': "Adam sometimes finds sharp minima that don't generalize as well as SGD's flatter minima.", 'solution': 'Use learning rate scheduling with Adam, or switch to SGD+momentum after warmup with Adam.'}
{'pitfall': 'Forgetting bias correction', 'description': 'Early iterations of Adam have biased estimates due to zero initialization.', 'solution': 'Always include bias correction terms; most library implementations do this automatically.'}