Newton's Method

Advanced Optimization
~9 min read Optimization
Prerequisites:

Definition

Newton's method is a second-order optimization algorithm that uses both the gradient (first derivative) and the Hessian matrix (second derivative) to find the minimum of a function. Unlike first-order methods like gradient descent that only consider the local slope, Newton's method approximates the loss function as a quadratic using Taylor expansion and jumps directly to the minimum of that approximation. This provides significantly faster convergence, achieving quadratic convergence rates near the optimum compared to the linear rate of gradient descent. The method computes the optimal step by multiplying the gradient by the inverse Hessian, effectively rescaling the update direction to account for the local curvature of the loss landscape. Newton's method is particularly powerful for convex optimization and problems where the Hessian can be computed efficiently.

Intuition

💡

Imagine you're blindfolded in a valley and want to reach the lowest point. Gradient descent tells you which way is downhill right now, like feeling the ground at your feet and stepping in the steepest direction. Newton's method is like having a magical ability to sense not just the slope but also how the slope is changing (curvature) in all directions. If you're on a gentle slope that's about to become steep, Newton's method knows to take a bigger step. If you're on steep terrain that's flattening out, it knows to take a smaller step. The Hessian inverse acts like a 'curvature-aware compass' that tells you exactly how far to step in each direction, not just which way to go. In a quadratic bowl, Newton's method jumps directly to the bottom in one step, while gradient descent takes many small steps zigzagging down.

Mathematical Formula

\[ \text{Update Rule:} \quad \theta_{t+1} = \theta_t - H^{-1}(\theta_t) abla L(\theta_t) \text{Where:} \quad H_{ij} = \frac{\partial^2 L}{\partial \theta_i \partial \theta_j} \text{Damped Newton:} \quad \theta_{t+1} = \theta_t - \alpha H^{-1}(\theta_t) abla L(\theta_t) \text{Regularized:} \quad \theta_{t+1} = \theta_t - (H + \lambda I)^{-1} abla L(\theta_t) \]

Step-by-Step Explanation:

  1. Step 1: Compute gradient vector g = ∇L(θ_t) at current parameters
  2. Step 2: Compute Hessian matrix H, where H_ij = ∂²L/∂θ_i∂θ_j (second derivatives)
  3. Step 3: Solve linear system H * Δθ = g for update direction Δθ (or compute H^{-1} * g directly)
  4. Step 4: Update parameters: θ_{t+1} = θ_t - H^{-1} * g (for convex problems, subtract; adjust sign for maximization)
  5. Step 5: Damped version uses step size α (found via line search) for non-quadratic functions
  6. Step 6: Regularized version adds λI to H for numerical stability when H is not positive definite
  7. Key insight: H^{-1} rescales gradient direction based on local curvature, providing optimal step sizes

Real-World Use Cases

Logistic Regression

Training logistic regression with Newton's method (IRLS - Iteratively Reweighted Least Squares) converges in 5-10 iterations compared to hundreds for gradient descent, especially effective for small-to-medium datasets.

Portfolio Optimization

Mean-variance portfolio optimization where the objective is quadratic; Newton's method finds the optimal asset allocation that maximizes return for given risk in few iterations.

Gaussian Process Regression

Optimizing hyperparameters of Gaussian Processes requires maximizing marginal likelihood; Newton's method efficiently handles the non-convex optimization.

Structural Engineering

Finite element analysis optimization where equilibrium equations are solved; Newton-Raphson method is standard for nonlinear structural analysis.

Robotics

Inverse kinematics problems where robot joint angles must be computed to achieve desired end-effector positions; Newton's method provides fast convergence.

Chemical Engineering

Process optimization and parameter estimation in chemical reaction networks where accurate convergence is critical.

Implementation

Manual Implementation (No Libraries)

The implementation shows Newton's method for general optimization and specifically for logistic regression (IRLS). For the quadratic function, it converges in 1 iteration since Newton's method is exact for quadratics. For logistic regression, the Hessian is computed as X^T W X where W contains the predicted probabilities. Cholesky decomposition provides numerical stability. Regularization handles cases where Hessian is not positive definite.
import numpy as np
from scipy.optimize import minimize

def newtons_method(grad_func, hess_func, theta0, tol=1e-6, max_iter=100):
    """
    Pure Newton's method for optimization.
    
    Args:
        grad_func: Function that computes gradient ∇L(θ)
        hess_func: Function that computes Hessian H(θ)
        theta0: Initial parameter vector
        tol: Convergence tolerance
        max_iter: Maximum iterations
    
    Returns:
        theta: Optimized parameters
        history: List of (theta, loss) tuples
    """
    theta = theta0.copy()
    history = []
    
    for i in range(max_iter):
        # Compute gradient and Hessian
        grad = grad_func(theta)
        hess = hess_func(theta)
        
        # Check convergence
        grad_norm = np.linalg.norm(grad)
        if grad_norm < tol:
            print(f'Converged at iteration {i}')
            break
        
        # Solve H * delta = grad for update direction
        # Using Cholesky decomposition for stability
        try:
            L = np.linalg.cholesky(hess)
            # Solve L * y = grad, then L^T * delta = y
            y = np.linalg.solve(L, grad)
            delta = np.linalg.solve(L.T, y)
        except np.linalg.LinAlgError:
            # Hessian not positive definite, add regularization
            print(f'Iteration {i}: Hessian not PD, adding regularization')
            hess_reg = hess + 0.1 * np.eye(len(theta))
            delta = np.linalg.solve(hess_reg, grad)
        
        # Update parameters
        theta = theta - delta
        history.append((theta.copy(), grad_norm))
    
    return theta, history

# Example: Quadratic function f(x) = 0.5 * x^T A x + b^T x + c
# Minimum at x = -A^{-1} b
A = np.array([[4, 1], [1, 3]])
b = np.array([-1, 2])

def grad_quad(x):
    return A @ x + b

def hess_quad(x):
    return A  # Constant for quadratic

print('=== Newton on Quadratic Function ===')
x0 = np.array([0.0, 0.0])
x_opt, history = newtons_method(grad_quad, hess_quad, x0)
print(f'Optimal x: {x_opt}')
print(f'True optimum: {-np.linalg.inv(A) @ b}')
print(f'Converged in {len(history)} iterations')

# Example: Logistic Regression
def sigmoid(z):
    return 1 / (1 + np.exp(-np.clip(z, -250, 250)))

def logistic_newton(X, y, max_iter=10):
    """Newton's method for logistic regression (IRLS)."""
    n_samples, n_features = X.shape
    theta = np.zeros(n_features)
    
    for i in range(max_iter):
        # Predictions
        z = X @ theta
        p = sigmoid(z)
        
        # Gradient
        grad = X.T @ (p - y)
        
        # Hessian: X^T W X where W is diagonal with p*(1-p)
        W = np.diag(p * (1 - p))
        hess = X.T @ W @ X
        
        # Newton update
        try:
            delta = np.linalg.solve(hess, grad)
        except np.linalg.LinAlgError:
            hess_reg = hess + 1e-5 * np.eye(n_features)
            delta = np.linalg.solve(hess_reg, grad)
        
        theta = theta - delta
        
        # Check convergence
        if np.linalg.norm(grad) < 1e-6:
            print(f'Logistic regression converged at iteration {i+1}')
            break
    
    return theta

# Generate synthetic logistic regression data
np.random.seed(42)
n_samples, n_features = 100, 3
X = np.random.randn(n_samples, n_features)
true_theta = np.array([1.5, -1.0, 0.5])
probs = sigmoid(X @ true_theta)
y = (np.random.rand(n_samples) < probs).astype(float)

print('
=== Logistic Regression with Newton ===')
theta_lr = logistic_newton(X, y)
print(f'Estimated parameters: {theta_lr}')
print(f'True parameters: {true_theta}')

Using Libraries (scipy.optimize.minimize (Newton-CG, trust-ncg), torch.autograd (for Hessian computation))

import numpy as np
from scipy.optimize import minimize

# scipy.optimize provides Newton-CG and trust-ncg methods
def objective(x):
    return x[0]**2 + 2*x[1]**2 + 0.5*x[0]*x[1] - x[0] + 2*x[1]

def gradient(x):
    return np.array([2*x[0] + 0.5*x[1] - 1, 4*x[1] + 0.5*x[0] + 2])

def hessian(x):
    return np.array([[2, 0.5], [0.5, 4]])

# Newton-CG method
result = minimize(objective, x0=[0, 0], method='Newton-CG',
                  jac=gradient, hess=hessian,
                  options={'xtol': 1e-8, 'disp': True})
print(f'Optimum: {result.x}')
print(f'Function value: {result.fun}')

# Trust-region Newton
result_trust = minimize(objective, x0=[0, 0], method='trust-ncg',
                        jac=gradient, hess=hessian)
print(f'
Trust-region result: {result_trust.x}')

# For deep learning (rarely used directly)
# PyTorch Hessian computation (for small models)
import torch

def compute_hessian_pytorch(model, loss_fn, data, target):
    """Compute Hessian for small neural network."""
    params = [p for p in model.parameters() if p.requires_grad]
    flat_params = torch.cat([p.view(-1) for p in params])
    
    # Compute gradient
    loss = loss_fn(model(data), target)
    grads = torch.autograd.grad(loss, params, create_graph=True)
    flat_grad = torch.cat([g.view(-1) for g in grads])
    
    # Compute Hessian
    n_params = flat_params.numel()
    hessian = torch.zeros(n_params, n_params)
    
    for i in range(n_params):
        grad2 = torch.autograd.grad(flat_grad[i], params, retain_graph=True)
        hessian[i] = torch.cat([g.view(-1) for g in grad2])
    
    return hessian

When to Use

✅ Appropriate Use Cases:

  • When the problem is convex and medium-sized (hundreds to thousands of parameters)
  • For quadratic or nearly-quadratic objective functions
  • When Hessian can be computed efficiently or has special structure
  • Logistic regression and GLMs (via IRLS) where Hessian has closed form
  • When extremely fast convergence is needed and memory permits
  • For problems where curvature information significantly improves optimization
  • Gaussian process hyperparameter optimization
  • Engineering optimization problems with expensive function evaluations

❌ Avoid When:

  • For deep neural networks (millions of parameters—Hessian is intractable)
  • When Hessian computation is too expensive (O(n²) memory, O(n³) inversion)
  • For large-scale machine learning problems (use SGD or Adam instead)
  • When the objective is non-convex without safeguards (can converge to saddle points)
  • For online/streaming learning where data arrives incrementally
  • When Hessian is not positive definite and causes numerical issues
  • For problems where gradient computation is already expensive (Hessian even worse)

Common Pitfalls

  • {'pitfall': 'Computational cost', 'description': 'Computing and inverting the Hessian costs O(n²) memory and O(n³) time, prohibitive for large n.', 'solution': 'Use quasi-Newton methods (L-BFGS) that approximate Hessian, or conjugate gradient for large systems.'}
  • {'pitfall': 'Non-positive definite Hessian', 'description': 'In non-convex regions, Hessian may have negative eigenvalues, causing updates to move uphill.', 'solution': 'Add regularization (H + λI), use trust-region methods, or switch to gradient descent in problematic regions.'}
  • {'pitfall': 'Convergence to saddle points', 'description': "Newton's method treats saddle points attractively since gradient is zero, unlike gradient descent which repels from saddles.", 'solution': 'Check Hessian eigenvalues; if negative, use gradient direction or add sufficient regularization.'}
  • {'pitfall': 'Inexact Hessian approximation', 'description': 'Using finite differences or approximations can lead to poor convergence.', 'solution': 'Use analytical Hessian when possible; if approximating, ensure sufficient accuracy.'}
  • {'pitfall': 'Poor performance far from optimum', 'description': "Newton's method assumes quadratic approximation which may be poor far from minimum.", 'solution': 'Use damped Newton with line search, or trust-region methods that limit step size.'}