Quasi-Newton Methods (L-BFGS)
Definition
Quasi-Newton methods are a family of optimization algorithms that approximate Newton's method without explicitly computing or storing the Hessian matrix. Instead of the expensive O(n²) Hessian storage and O(n³) inversion, these methods build an approximation of the inverse Hessian (or Hessian itself) using only gradient information from previous iterations. The BFGS algorithm (Broyden-Fletcher-Goldfarb-Shanno) is the most popular quasi-Newton method, maintaining a dense approximation with O(n²) storage. L-BFGS (Limited-memory BFGS) is its scalable variant that uses only O(mn) memory where m is a small constant (typically 10-20), making it suitable for problems with thousands to millions of parameters. These methods achieve superlinear convergence—faster than gradient descent but without the prohibitive costs of Newton's method—making them the method of choice for medium-scale optimization problems.
Intuition
Imagine you're navigating a complex landscape and want to know which way to step, but calculating the full curvature map (Hessian) is impossibly expensive. Quasi-Newton methods are like being a smart hiker who remembers how the slope changed between your previous positions. By remembering just a few past locations and gradients (like 'at the ridge, slope was steep east; now in valley, slope is gentle'), you can infer the shape of the terrain without mapping every detail. L-BFGS is the ultra-lightweight version—instead of remembering the entire history, you only keep the last 10-20 positions, like having a short-term memory that forgets old information to save space. This approximation gets you nearly the same benefit as knowing the full curvature but with minimal memory. The 'secant equation' ensures your approximate curvature matches what you've actually observed on your path.
Mathematical Formula
Step-by-Step Explanation:
- Secant condition: Approximate Hessian B should map step s_k to gradient change y_k
- s_k (step): Difference between current and previous parameter values
- y_k (gradient difference): Change in gradient between iterations
- BFGS update: Low-rank update to Hessian approximation satisfying secant condition
- First term: Removes old curvature information along step direction
- Second term: Adds new curvature information from gradient change
- L-BFGS: Instead of storing \(B_k\) explicitly, stores last m pairs (s_i, y_i)
- Two-loop recursion: Computes H_k * g efficiently using stored vector pairs
- ρ_i = 1/(\(y_i^T s_i\)): Scaling factor for each stored pair
Real-World Use Cases
Training large-scale logistic regression with L-BFGS converges faster than SGD for datasets up to millions of samples, commonly used in LIBLINEAR and scikit-learn.
Training multinomial logistic regression for document classification with thousands of classes (like news categorization) where L-BFGS efficiently handles the parameter space.
Training CRFs for sequence labeling (NER, POS tagging) where the objective is convex but high-dimensional; L-BFGS is standard in CRF++ and similar tools.
Non-negative matrix factorization and collaborative filtering where alternating least squares or direct optimization with L-BFGS is preferred over SGD.
Optimizing hyperparameters of ML models where each evaluation is expensive; L-BFGS efficiently explores the search space.
Structural optimization, aerodynamic design, and other engineering problems with expensive simulations where gradient information is available but Hessian is not.
Implementation
Manual Implementation (No Libraries)
import numpy as np
def lbfgs_optimizer(grad_func, theta0, m=10, max_iter=100, tol=1e-5, c1=1e-4, c2=0.9):
"""
L-BFGS optimizer with two-loop recursion and Wolfe line search.
Args:
grad_func: Function returning (loss, gradient) tuple
theta0: Initial parameters
m: Memory size (number of vector pairs to store)
max_iter: Maximum iterations
tol: Convergence tolerance
c1, c2: Wolfe condition constants
"""
theta = theta0.copy()
loss, grad = grad_func(theta)
# Storage for L-BFGS
s_list = [] # Step differences
y_list = [] # Gradient differences
rho_list = [] # Scaling factors
history = [(loss, np.linalg.norm(grad))]
for k in range(max_iter):
# Two-loop recursion to compute search direction
q = grad.copy()
alphas = []
# First loop (reverse)
for i in range(len(s_list) - 1, -1, -1):
rho = rho_list[i]
s = s_list[i]
y = y_list[i]
alpha = rho * np.dot(s, q)
alphas.append(alpha)
q = q - alpha * y
# Initial Hessian approximation (scaled identity)
if len(s_list) > 0:
\(\gamma\) = np.dot(s_list[-1], y_list[-1]) / np.dot(y_list[-1], y_list[-1])
r = \(\gamma\) * q
else:
r = q # Steepest descent for first iteration
# Second loop (forward)
for i in range(len(s_list)):
rho = rho_list[i]
s = s_list[i]
y = y_list[i]
alpha = alphas[len(alphas) - 1 - i]
beta = rho * np.dot(y, r)
r = r + s * (alpha - beta)
# r is now the approximate inverse Hessian times gradient
p = -r # Search direction
# Wolfe line search
alpha = 1.0
loss_new, grad_new = grad_func(theta + alpha * p)
# Simple backtracking if Wolfe not satisfied
while loss_new > loss + c1 * alpha * np.dot(grad, p) and alpha > 1e-10:
alpha *= 0.5
loss_new, grad_new = grad_func(theta + alpha * p)
# Update parameters
theta_new = theta + alpha * p
s = theta_new - theta
y = grad_new - grad
# Update storage (maintain limited memory)
if len(s_list) >= m:
s_list.pop(0)
y_list.pop(0)
rho_list.pop(0)
ys = np.dot(y, s)
if ys > 1e-10: # Only update if curvature condition satisfied
s_list.append(s)
y_list.append(y)
rho_list.append(1.0 / ys)
theta = theta_new
loss = loss_new
grad = grad_new
history.append((loss, np.linalg.norm(grad)))
# Check convergence
if np.linalg.norm(grad) < tol:
print(f'Converged at iteration {k+1}')
break
if (k + 1) % 10 == 0:
print(f'Iter {k+1}: Loss={loss:.6f}, |grad|={np.linalg.norm(grad):.6f}')
return theta, history
# Test on Rosenbrock function (classic optimization test)
def rosenbrock(x):
"""Rosenbrock function (banana function)."""
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x):
"""Gradient of Rosenbrock function."""
loss = rosenbrock(x)
dfdx0 = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
dfdx1 = 200 * (x[1] - x[0]**2)
return loss, np.array([dfdx0, dfdx1])
print('=== L-BFGS on Rosenbrock Function ===')
x0 = np.array([-1.0, 2.0]) # Starting far from optimum at (1, 1)
x_opt, history = lbfgs_optimizer(rosenbrock_grad, x0, m=10, max_iter=100)
print(f'
Optimal x: {x_opt}')
print(f'Function value: {rosenbrock(x_opt):.10f}')
print(f'True optimum: (1, 1), f=0')
# Quadratic function test
def quadratic(x):
A = np.array([[4, 1], [1, 3]])
b = np.array([-1, 2])
loss = 0.5 * x @ A @ x + b @ x
grad = A @ x + b
return loss, grad
print('
=== L-BFGS on Quadratic Function ===')
x0 = np.array([0.0, 0.0])
x_opt, history = lbfgs_optimizer(quadratic, x0, m=5, max_iter=50)
print(f'Optimal x: {x_opt}')
print(f'Expected: {-np.linalg.inv(np.array([[4,1],[1,3]])) @ np.array([-1, 2])}')
Using Libraries (scipy.optimize.minimize (L-BFGS-B), sklearn.linear_model.LogisticRegression, torch.optim.LBFGS)
import numpy as np
from scipy.optimize import minimize
# scipy provides excellent L-BFGS-B implementation
def rosenbrock(x):
return (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
def rosenbrock_grad(x):
dfdx0 = -2 * (1 - x[0]) - 400 * x[0] * (x[1] - x[0]**2)
dfdx1 = 200 * (x[1] - x[0]**2)
return np.array([dfdx0, dfdx1])
# L-BFGS-B (B = Bounded, supports box constraints)
result = minimize(rosenbrock, x0=[-1, 2], method='L-BFGS-B',
jac=rosenbrock_grad,
options={'maxiter': 100, 'disp': True})
print(f'
scipy L-BFGS-B result: {result.x}')
print(f'Function value: {result.fun:.10f}')
print(f'Iterations: {result.nit}, Function evals: {result.nfev}')
# For machine learning: sklearn uses L-BFGS for some models
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=20, n_classes=2, random_state=42)
# Use L-BFGS solver
lr = LogisticRegression(solver='lbfgs', max_iter=1000, verbose=1)
lr.fit(X, y)
print(f'
Logistic Regression converged in {lr.n_iter_[0]} iterations')
# PyTorch L-BFGS (limited support)
import torch
def pytorch_lbfgs_example():
"""PyTorch L-BFGS optimizer example."""
x = torch.tensor([-1.0, 2.0], requires_grad=True)
# L-BFGS requires closure that recomputes loss
def closure():
optimizer.zero_grad()
loss = (1 - x[0])**2 + 100 * (x[1] - x[0]**2)**2
loss.backward()
return loss
optimizer = torch.optim.LBFGS([x], max_iter=20, line_search_fn='strong_wolfe')
for i in range(10):
loss = optimizer.step(closure)
if i % 2 == 0:
print(f'Iter {i}: x={x.detach().numpy()}, loss={loss.item():.6f}')
return x
print('
=== PyTorch L-BFGS ===')
x_pytorch = pytorch_lbfgs_example()
When to Use
✅ Appropriate Use Cases:
- For medium-scale problems (100 to 100,000+ parameters)
- When the objective function is smooth and gradients are available
- Logistic regression, multinomial regression, and convex GLMs
- Problems where memory permits O(mn) storage but not O(n²)
- When faster convergence than gradient descent is needed
- Full-batch training when entire dataset fits in memory
- Non-linear least squares problems
- When line search is feasible (not mini-batch setting)
❌ Avoid When:
- For stochastic/mini-batch training (line search doesn't work well with noisy gradients)
- Deep neural networks (millions of parameters, non-convex, stochastic)
- Online learning where data arrives sequentially
- When function evaluations are cheap but gradient computation is expensive
- Very large-scale problems (n > 10^7) where even O(mn) memory is too much
- Problems with non-smooth objectives (L1 regularization, ReLU without smoothing)
- When you need real-time updates with streaming data
Common Pitfalls
- {'pitfall': 'Using L-BFGS with mini-batches', 'description': 'L-BFGS assumes deterministic gradients; stochastic gradients break the curvature approximation.', 'solution': 'Use SGD, Adam, or other stochastic optimizers. L-BFGS is for full-batch optimization.'}
- {'pitfall': 'Memory limit exceeded', 'description': 'For very high-dimensional problems, even O(mn) storage can be prohibitive.', 'solution': 'Reduce memory parameter m, or use first-order methods like SGD or conjugate gradient.'}
- {'pitfall': 'Poor line search performance', 'description': 'Inexact line search can lead to poor convergence or excessive function evaluations.', 'solution': 'Tune Wolfe condition parameters or use strong Wolfe line search. Monitor line search iterations.'}
- {'pitfall': 'Negative curvature ignored', 'description': 'Skipping updates when y^T s <= 0 discards valid information in non-convex problems.', 'solution': 'Use damped BFGS variants or SR1 update for indefinite Hessians.'}
- {'pitfall': 'Incorrect gradient computation', 'description': 'L-BFGS is sensitive to gradient accuracy; numerical errors can cause divergence.', 'solution': 'Verify gradients with finite differences, use automatic differentiation.'}