Newton's Method
Definition
Newton's method is a second-order optimization algorithm that uses both the gradient (first derivative) and the Hessian matrix (second derivative) to find the minimum of a function. Unlike first-order methods like gradient descent that only consider the local slope, Newton's method approximates the loss function as a quadratic using Taylor expansion and jumps directly to the minimum of that approximation. This provides significantly faster convergence, achieving quadratic convergence rates near the optimum compared to the linear rate of gradient descent. The method computes the optimal step by multiplying the gradient by the inverse Hessian, effectively rescaling the update direction to account for the local curvature of the loss landscape. Newton's method is particularly powerful for convex optimization and problems where the Hessian can be computed efficiently.
Intuition
Imagine you're blindfolded in a valley and want to reach the lowest point. Gradient descent tells you which way is downhill right now, like feeling the ground at your feet and stepping in the steepest direction. Newton's method is like having a magical ability to sense not just the slope but also how the slope is changing (curvature) in all directions. If you're on a gentle slope that's about to become steep, Newton's method knows to take a bigger step. If you're on steep terrain that's flattening out, it knows to take a smaller step. The Hessian inverse acts like a 'curvature-aware compass' that tells you exactly how far to step in each direction, not just which way to go. In a quadratic bowl, Newton's method jumps directly to the bottom in one step, while gradient descent takes many small steps zigzagging down.
Mathematical Formula
Step-by-Step Explanation:
- Step 1: Compute gradient vector g = ∇L(θ_t) at current parameters
- Step 2: Compute Hessian matrix H, where H_ij = ∂²L/∂θ_i∂θ_j (second derivatives)
- Step 3: Solve linear system H * Δθ = g for update direction Δθ (or compute H^{-1} * g directly)
- Step 4: Update parameters: θ_{t+1} = θ_t - H^{-1} * g (for convex problems, subtract; adjust sign for maximization)
- Step 5: Damped version uses step size α (found via line search) for non-quadratic functions
- Step 6: Regularized version adds λI to H for numerical stability when H is not positive definite
- Key insight: H^{-1} rescales gradient direction based on local curvature, providing optimal step sizes
Real-World Use Cases
Training logistic regression with Newton's method (IRLS - Iteratively Reweighted Least Squares) converges in 5-10 iterations compared to hundreds for gradient descent, especially effective for small-to-medium datasets.
Mean-variance portfolio optimization where the objective is quadratic; Newton's method finds the optimal asset allocation that maximizes return for given risk in few iterations.
Optimizing hyperparameters of Gaussian Processes requires maximizing marginal likelihood; Newton's method efficiently handles the non-convex optimization.
Finite element analysis optimization where equilibrium equations are solved; Newton-Raphson method is standard for nonlinear structural analysis.
Inverse kinematics problems where robot joint angles must be computed to achieve desired end-effector positions; Newton's method provides fast convergence.
Process optimization and parameter estimation in chemical reaction networks where accurate convergence is critical.
Implementation
Manual Implementation (No Libraries)
import numpy as np
from scipy.optimize import minimize
def newtons_method(grad_func, hess_func, theta0, tol=1e-6, max_iter=100):
"""
Pure Newton's method for optimization.
Args:
grad_func: Function that computes gradient ∇L(θ)
hess_func: Function that computes Hessian H(θ)
theta0: Initial parameter vector
tol: Convergence tolerance
max_iter: Maximum iterations
Returns:
theta: Optimized parameters
history: List of (theta, loss) tuples
"""
theta = theta0.copy()
history = []
for i in range(max_iter):
# Compute gradient and Hessian
grad = grad_func(theta)
hess = hess_func(theta)
# Check convergence
grad_norm = np.linalg.norm(grad)
if grad_norm < tol:
print(f'Converged at iteration {i}')
break
# Solve H * delta = grad for update direction
# Using Cholesky decomposition for stability
try:
L = np.linalg.cholesky(hess)
# Solve L * y = grad, then L^T * delta = y
y = np.linalg.solve(L, grad)
delta = np.linalg.solve(L.T, y)
except np.linalg.LinAlgError:
# Hessian not positive definite, add regularization
print(f'Iteration {i}: Hessian not PD, adding regularization')
hess_reg = hess + 0.1 * np.eye(len(theta))
delta = np.linalg.solve(hess_reg, grad)
# Update parameters
theta = theta - delta
history.append((theta.copy(), grad_norm))
return theta, history
# Example: Quadratic function f(x) = 0.5 * x^T A x + b^T x + c
# Minimum at x = -A^{-1} b
A = np.array([[4, 1], [1, 3]])
b = np.array([-1, 2])
def grad_quad(x):
return A @ x + b
def hess_quad(x):
return A # Constant for quadratic
print('=== Newton on Quadratic Function ===')
x0 = np.array([0.0, 0.0])
x_opt, history = newtons_method(grad_quad, hess_quad, x0)
print(f'Optimal x: {x_opt}')
print(f'True optimum: {-np.linalg.inv(A) @ b}')
print(f'Converged in {len(history)} iterations')
# Example: Logistic Regression
def sigmoid(z):
return 1 / (1 + np.exp(-np.clip(z, -250, 250)))
def logistic_newton(X, y, max_iter=10):
"""Newton's method for logistic regression (IRLS)."""
n_samples, n_features = X.shape
theta = np.zeros(n_features)
for i in range(max_iter):
# Predictions
z = X @ theta
p = sigmoid(z)
# Gradient
grad = X.T @ (p - y)
# Hessian: X^T W X where W is diagonal with p*(1-p)
W = np.diag(p * (1 - p))
hess = X.T @ W @ X
# Newton update
try:
delta = np.linalg.solve(hess, grad)
except np.linalg.LinAlgError:
hess_reg = hess + 1e-5 * np.eye(n_features)
delta = np.linalg.solve(hess_reg, grad)
theta = theta - delta
# Check convergence
if np.linalg.norm(grad) < 1e-6:
print(f'Logistic regression converged at iteration {i+1}')
break
return theta
# Generate synthetic logistic regression data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_theta = np.array([1.5, -1.0, 0.5])
probs = sigmoid(X @ true_theta)
y = (np.random.rand(n_samples) < probs).astype(float)
print('
=== Logistic Regression with Newton ===')
theta_lr = logistic_newton(X, y)
print(f'Estimated parameters: {theta_lr}')
print(f'True parameters: {true_theta}')
Using Libraries (scipy.optimize.minimize (Newton-CG, trust-ncg), torch.autograd (for Hessian computation))
import numpy as np
from scipy.optimize import minimize
# scipy.optimize provides Newton-CG and trust-ncg methods
def objective(x):
return x[0]**2 + 2*x[1]**2 + 0.5*x[0]*x[1] - x[0] + 2*x[1]
def gradient(x):
return np.array([2*x[0] + 0.5*x[1] - 1, 4*x[1] + 0.5*x[0] + 2])
def hessian(x):
return np.array([[2, 0.5], [0.5, 4]])
# Newton-CG method
result = minimize(objective, x0=[0, 0], method='Newton-CG',
jac=gradient, hess=hessian,
options={'xtol': 1e-8, 'disp': True})
print(f'Optimum: {result.x}')
print(f'Function value: {result.fun}')
# Trust-region Newton
result_trust = minimize(objective, x0=[0, 0], method='trust-ncg',
jac=gradient, hess=hessian)
print(f'
Trust-region result: {result_trust.x}')
# For deep learning (rarely used directly)
# PyTorch Hessian computation (for small models)
import torch
def compute_hessian_pytorch(model, loss_fn, data, target):
"""Compute Hessian for small neural network."""
params = [p for p in model.parameters() if p.requires_grad]
flat_params = torch.cat([p.view(-1) for p in params])
# Compute gradient
loss = loss_fn(model(data), target)
grads = torch.autograd.grad(loss, params, create_graph=True)
flat_grad = torch.cat([g.view(-1) for g in grads])
# Compute Hessian
n_params = flat_params.numel()
hessian = torch.zeros(n_params, n_params)
for i in range(n_params):
grad2 = torch.autograd.grad(flat_grad[i], params, retain_graph=True)
hessian[i] = torch.cat([g.view(-1) for g in grad2])
return hessian
When to Use
✅ Appropriate Use Cases:
- When the problem is convex and medium-sized (hundreds to thousands of parameters)
- For quadratic or nearly-quadratic objective functions
- When Hessian can be computed efficiently or has special structure
- Logistic regression and GLMs (via IRLS) where Hessian has closed form
- When extremely fast convergence is needed and memory permits
- For problems where curvature information significantly improves optimization
- Gaussian process hyperparameter optimization
- Engineering optimization problems with expensive function evaluations
❌ Avoid When:
- For deep neural networks (millions of parameters—Hessian is intractable)
- When Hessian computation is too expensive (O(n²) memory, O(n³) inversion)
- For large-scale machine learning problems (use SGD or Adam instead)
- When the objective is non-convex without safeguards (can converge to saddle points)
- For online/streaming learning where data arrives incrementally
- When Hessian is not positive definite and causes numerical issues
- For problems where gradient computation is already expensive (Hessian even worse)
Common Pitfalls
- {'pitfall': 'Computational cost', 'description': 'Computing and inverting the Hessian costs O(n²) memory and O(n³) time, prohibitive for large n.', 'solution': 'Use quasi-Newton methods (L-BFGS) that approximate Hessian, or conjugate gradient for large systems.'}
- {'pitfall': 'Non-positive definite Hessian', 'description': 'In non-convex regions, Hessian may have negative eigenvalues, causing updates to move uphill.', 'solution': 'Add regularization (H + λI), use trust-region methods, or switch to gradient descent in problematic regions.'}
- {'pitfall': 'Convergence to saddle points', 'description': "Newton's method treats saddle points attractively since gradient is zero, unlike gradient descent which repels from saddles.", 'solution': 'Check Hessian eigenvalues; if negative, use gradient direction or add sufficient regularization.'}
- {'pitfall': 'Inexact Hessian approximation', 'description': 'Using finite differences or approximations can lead to poor convergence.', 'solution': 'Use analytical Hessian when possible; if approximating, ensure sufficient accuracy.'}
- {'pitfall': 'Poor performance far from optimum', 'description': "Newton's method assumes quadratic approximation which may be poor far from minimum.", 'solution': 'Use damped Newton with line search, or trust-region methods that limit step size.'}