AdaGrad, RMSprop, and Adam
Definition
AdaGrad, RMSprop, and Adam represent the evolution of adaptive learning rate methods that automatically adjust the learning rate for each parameter based on the historical gradients. Unlike SGD with a global learning rate, these methods maintain per-parameter learning rates that adapt to the geometry of the loss landscape. AdaGrad (2011) was the pioneer, adapting learning rates based on the sum of squared historical gradients—giving smaller updates to frequently occurring features. RMSprop (2012) improved upon AdaGrad by using an exponentially decaying average instead of cumulative sum, preventing the learning rate from becoming too small. Adam (2014) combined the best of both worlds: adaptive learning rates from RMSprop plus momentum, making it the most popular optimizer in deep learning today. These methods are particularly effective for sparse gradients and non-stationary objectives.
Intuition
Imagine walking through a landscape where some directions are steep and well-defined while others are flat and uncertain. Traditional SGD takes steps of the same size in all directions. Adaptive methods are like having smart boots that automatically adjust step size: in directions where the slope has been consistently steep (like ravines), take smaller, careful steps to avoid overshooting; in directions where the terrain has been flat or sparse (like plateaus), take larger, bolder steps to cover more ground quickly. AdaGrad remembers every slope you've ever seen, making it conservative over time. RMSprop has a 'forgetting factor'—it only remembers recent slopes, staying adaptive throughout training. Adam adds memory of your velocity (like momentum), so you keep rolling in directions that have been consistently downhill while still adapting step sizes. It's like having both smart boots and a rolling start.
Mathematical Formula
Step-by-Step Explanation:
- AdaGrad Step 1: Accumulate squared gradients G_t = G_{t-1} + g_t^2 for each parameter
- AdaGrad Step 2: Adapt learning rate inversely proportional to sqrt of accumulated squared gradients
- RMSprop Step 1: Maintain exponential moving average of squared gradients v_t
- RMSprop Step 2: Beta (typically 0.9) controls decay rate—higher means longer memory
- RMSprop Step 3: Use sqrt of moving average to adapt per-parameter learning rates
- Adam Step 1: Maintain twoEMAs: m_t (first moment—gradient mean) and v_t (second moment—gradient variance)
- Adam Step 2: Apply bias correction to counteract zero initialization: divide by (1-beta^t)
- Adam Step 3: Update using adaptive learning rate on bias-corrected momentum
- Epsilon (typically 1e-8): Small constant for numerical stability to avoid division by zero
Real-World Use Cases
Training transformer models (BERT, GPT) where gradients are sparse across embedding dimensions. AdaGrad's per-parameter adaptation helps with rare word embeddings that receive infrequent updates.
Training matrix factorization models with sparse user-item interactions. RMSprop/Adam handle varying gradient frequencies across different user/item latent factors.
Training CNNs like ResNet, EfficientNet where Adam is the default choice. Automatic learning rate adaptation reduces need for extensive hyperparameter tuning.
Training policy and value networks where gradients are noisy due to environment stochasticity. Adam's momentum smoothing and adaptive rates stabilize training.
Fine-tuning pretrained models where different layers need different update speeds. Adam automatically handles varying gradient magnitudes across layers.
Training models with multiple loss components that may have vastly different gradient scales. Adaptive methods balance updates across tasks automatically.
Implementation
Manual Implementation (No Libraries)
import numpy as np
def adagrad_optimizer(X, y, lr=0.01, epsilon=1e-8, n_epochs=100, batch_size=32):
"""AdaGrad optimizer."""
n_samples, n_features = X.shape
theta = np.random.randn(n_features) * 0.01
accumulated_squared = np.zeros(n_features)
loss_history = []
for epoch in range(n_epochs):
indices = np.random.permutation(n_samples)
epoch_loss = 0
for i in range(0, n_samples, batch_size):
X_batch = X[indices[i:i+batch_size]]
y_batch = y[indices[i:i+batch_size]]
y_pred = X_batch @ theta
gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
# Accumulate squared gradients
accumulated_squared += gradient ** 2
# Adaptive update
adapted_lr = lr / (np.sqrt(accumulated_squared) + epsilon)
theta = theta - adapted_lr * gradient
epoch_loss += np.mean((y_pred - y_batch) ** 2)
loss_history.append(epoch_loss / (n_samples // batch_size))
return theta, loss_history
def rmsprop_optimizer(X, y, lr=0.001, beta=0.9, epsilon=1e-8,
n_epochs=100, batch_size=32):
"""RMSprop optimizer."""
n_samples, n_features = X.shape
theta = np.random.randn(n_features) * 0.01
moving_avg_squared = np.zeros(n_features)
loss_history = []
for epoch in range(n_epochs):
indices = np.random.permutation(n_samples)
epoch_loss = 0
for i in range(0, n_samples, batch_size):
X_batch = X[indices[i:i+batch_size]]
y_batch = y[indices[i:i+batch_size]]
y_pred = X_batch @ theta
gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
# Exponential moving average of squared gradients
moving_avg_squared = beta * moving_avg_squared + (1 - beta) * (gradient ** 2)
# Adaptive update
adapted_lr = lr / (np.sqrt(moving_avg_squared) + epsilon)
theta = theta - adapted_lr * gradient
epoch_loss += np.mean((y_pred - y_batch) ** 2)
loss_history.append(epoch_loss / (n_samples // batch_size))
return theta, loss_history
def adam_optimizer(X, y, lr=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8,
n_epochs=100, batch_size=32):
"""Adam optimizer with bias correction."""
n_samples, n_features = X.shape
theta = np.random.randn(n_features) * 0.01
m = np.zeros(n_features) # First moment
v = np.zeros(n_features) # Second moment
loss_history = []
for epoch in range(n_epochs):
indices = np.random.permutation(n_samples)
epoch_loss = 0
t = 0
for i in range(0, n_samples, batch_size):
t += 1
X_batch = X[indices[i:i+batch_size]]
y_batch = y[indices[i:i+batch_size]]
y_pred = X_batch @ theta
gradient = (2 / len(X_batch)) * X_batch.T @ (y_pred - y_batch)
# Update biased first moment estimate
m = beta1 * m + (1 - beta1) * gradient
# Update biased second raw moment estimate
v = beta2 * v + (1 - beta2) * (gradient ** 2)
# Compute bias-corrected estimates
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
# Update parameters
theta = theta - lr * m_hat / (np.sqrt(v_hat) + epsilon)
epoch_loss += np.mean((y_pred - y_batch) ** 2)
loss_history.append(epoch_loss / (n_samples // batch_size))
return theta, loss_history
# Comparison
np.random.seed(42)
X = np.random.randn(1000, 10)
true_theta = np.random.randn(10) * 2
y = X @ true_theta + np.random.randn(1000) * 0.5
print('=== AdaGrad ===')
theta_adagrad, losses_adagrad = adagrad_optimizer(X, y, lr=0.1, n_epochs=50)
print(f'Final loss: {losses_adagrad[-1]:.6f}')
print('
=== RMSprop ===')
theta_rmsprop, losses_rmsprop = rmsprop_optimizer(X, y, lr=0.01, n_epochs=50)
print(f'Final loss: {losses_rmsprop[-1]:.6f}')
print('
=== Adam ===')
theta_adam, losses_adam = adam_optimizer(X, y, lr=0.01, n_epochs=50)
print(f'Final loss: {losses_adam[-1]:.6f}')
Using Libraries (torch.optim.Adagrad, torch.optim.RMSprop, torch.optim.Adam, tensorflow.keras.optimizers.Adam)
import torch
import torch.nn as nn
# Setup
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)
dataset = torch.utils.data.TensorDataset(X, y)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
optimizers = {
'AdaGrad': lambda m: torch.optim.Adagrad(m.parameters(), lr=0.1),
'RMSprop': lambda m: torch.optim.RMSprop(m.parameters(), lr=0.001, alpha=0.9),
'Adam': lambda m: torch.optim.Adam(m.parameters(), lr=0.001,
betas=(0.9, 0.999))
}
criterion = nn.MSELoss()
for name, opt_fn in optimizers.items():
print(f'
=== PyTorch {name} ===')
model = nn.Linear(10, 1)
optimizer = opt_fn(model)
for epoch in range(50):
epoch_loss = 0
for batch_X, batch_y in dataloader:
optimizer.zero_grad()
outputs = model(batch_X)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
epoch_loss += loss.item()
if epoch % 10 == 0:
print(f'Epoch {epoch}: Loss = {epoch_loss/len(dataloader):.6f}')
# TensorFlow
import tensorflow as tf
print('
=== TensorFlow Adam ===')
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(10,))])
model.compile(
optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
loss='mse'
)
history = model.fit(X.numpy(), y.numpy(), epochs=50, verbose=0)
print(f'Final loss: {history.history["loss"][-1]:.6f}')
When to Use
✅ Appropriate Use Cases:
- AdaGrad: For sparse data with rare features (NLP with rare words, recommendation systems)
- AdaGrad: When you want aggressive learning rate decay for frequently updated parameters
- RMSprop: For non-stationary objectives where you need continuous adaptation
- RMSprop: Recurrent neural networks, sequence modeling tasks
- Adam: General-purpose deep learning (most popular default)
- Adam: When you want minimal hyperparameter tuning (works well with defaults)
- Adam: Large models where manual LR tuning per layer is impractical
- Any adaptive method: When gradients vary significantly across parameters or time
❌ Avoid When:
- AdaGrad: For dense gradients or when training many epochs (LR becomes too small)
- AdaGrad: Non-convex deep learning (RMSprop or Adam preferred)
- RMSprop: When you need the theoretical guarantees of AdaGrad for convex problems
- Adam: When training transformer models from scratch (often SGD+momentum works better)
- Adam: For some computer vision tasks where carefully tuned SGD+momentum outperforms
- Adam: When generalization is critical (may overfit compared to properly tuned SGD)
- Any adaptive method: When you have very limited compute (higher memory overhead)
Common Pitfalls
- {'pitfall': 'AdaGrad learning rate collapse', 'description': 'Accumulated squared gradients grow indefinitely, causing learning rate to shrink to near zero, stopping learning prematurely.', 'solution': 'Use RMSprop or Adam instead for deep learning; AdaGrad is mainly for sparse convex problems.'}
- {'pitfall': "Adam's epsilon sensitivity", 'description': 'Default epsilon=1e-8 may be too small for some applications, causing numerical instability.', 'solution': 'Increase epsilon to 1e-4 or 1e-3 for tasks with small gradients, like NLP or some RL applications.'}
- {'pitfall': 'Adam with weight decay confusion', 'description': "Standard L2 regularization doesn't work correctly with Adam's adaptive learning rates.", 'solution': 'Use AdamW which decouples weight decay from gradient updates, or use proper weight decay implementation.'}
- {'pitfall': 'Poor generalization with Adam', 'description': "Adam sometimes finds sharp minima that don't generalize as well as SGD's flatter minima.", 'solution': 'Use learning rate scheduling with Adam, or switch to SGD+momentum after warmup with Adam.'}
- {'pitfall': 'Forgetting bias correction', 'description': 'Early iterations of Adam have biased estimates due to zero initialization.', 'solution': 'Always include bias correction terms; most library implementations do this automatically.'}