Hyperparameter Optimization with Optuna

Definition

Optuna is an open-source hyperparameter optimization framework designed for machine learning. It uses Bayesian optimization with Tree-structured Parzen Estimators (TPE) to efficiently search the hyperparameter space, finding optimal configurations with fewer trials than grid or random search. Optuna supports various optimization algorithms including TPE, CMA-ES, and random search, along with advanced features like pruning unpromising trials early to save computation time. The framework provides a define-by-run API that allows dynamic construction of search spaces, making it flexible for complex pipelines. Optuna integrates seamlessly with popular ML frameworks like scikit-learn, PyTorch, TensorFlow, XGBoost, and LightGBM. It also provides distributed optimization capabilities for scaling across multiple machines and includes visualization tools for analyzing optimization results and hyperparameter importance.

Intuition

💡

Imagine you're trying to find the highest point on a mountain range while blindfolded. Grid search walks every possible path (slow), random search wanders randomly (wasteful), but Optuna is like an experienced guide who remembers every path taken and uses that knowledge to intelligently choose the next direction. It quickly learns which regions are promising and focuses the search there, finding peaks faster while ignoring flat areas.

Mathematical Formula

TPE Algorithm:

P(θ|y) = {l(θ) if y < y*

{g(θ) if y ≥ y*

Expected Improvement:

EI(θ) = E[max(0, y* - y)]

Step-by-Step Explanation:

θ: hyperparameter configuration
y: objective function value (loss/score)
y*: threshold quantile defining good vs bad trials
l(θ): density of good configurations (below threshold)
g(θ): density of bad configurations (above threshold)
EI(θ): expected improvement, guiding next trial selection
TPE models P(θ|y) separately for good and bad regions

Real-World Use Cases

Computer Vision

CNN architecture tuning: Optimize learning rate, batch size, dropout rates, and convolutional filter counts to maximize image classification accuracy while minimizing training time through early stopping.

Natural Language Processing

Transformer fine-tuning: Search optimal learning rate schedules, warmup steps, and layer-wise decay rates for BERT fine-tuning on domain-specific text classification tasks.

Recommendation Systems

Collaborative filtering optimization: Tune embedding dimensions, regularization coefficients, and learning rates for matrix factorization models to improve recommendation relevance metrics.

Financial Forecasting

Time series model tuning: Optimize window sizes, LSTM units, and regularization for stock price prediction while pruning trials that show poor convergence early.

Implementation

Manual Implementation (No Libraries)

Manual grid search is exponential in the number of parameters. Optuna's Bayesian approach builds a probabilistic model of the objective function, sampling more intelligently based on past results.

# Manual grid search - inefficient
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import numpy as np

# Define grid of hyperparameters
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4]
}

# Grid search tries ALL combinations (3 × 4 × 3 × 3 = 108)
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
    rf, param_grid, 
    cv=5, 
    scoring='accuracy',
    n_jobs=-1
)

# Time-consuming - no intelligent selection
grid_search.fit(X_train, y_train)
print(f'Best: {grid_search.best_params_}')
print(f'Score: {grid_search.best_score_}')

# Random search - slightly better but still blind
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
    rf, param_grid, n_iter=20,  # Only 20 random samples
    cv=5, scoring='accuracy', random_state=42
)

# Optuna's TPE approach:
# 1. Sample configurations
# 2. Separate into good/bad groups based on performance
# 3. Model l(θ) and g(θ) distributions
# 4. Sample next θ from region where l(θ)/g(θ) is highest
# 5. Repeat, getting smarter each iteration

Using Libraries (optuna, scikit-learn)

# Optuna hyperparameter optimization
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
import numpy as np

# Load data
data = load_iris()
X, y = data.data, data.target

# Define objective function
def objective(trial):
    # Define search space
    n_estimators = trial.suggest_int('n_estimators', 50, 300)
    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    
    # Model with sampled hyperparameters
    clf = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=42
    )
    
    # Cross-validation score
    score = cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()
    
    return score

# Create study and optimize
study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)

# Run optimization with 100 trials
study.optimize(objective, n_trials=100, show_progress_bar=True)

# Best results
print(f'Best score: {study.best_value:.4f}')
print(f'Best params: {study.best_params}')

# Visualizations
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study)
optuna.visualization.plot_slice(study)

# Save study
study.trials_dataframe().to_csv('optuna_results.csv')

# Pruning example - stop bad trials early
def objective_with_pruning(trial):
    # ... model setup ...
    
    for step in range(100):  # Training epochs
        # ... training step ...
        accuracy = model.validate()
        
        # Report intermediate result
        trial.report(accuracy, step)
        
        # Prune if unpromising
        if trial.should_prune():
            raise optuna.TrialPruned()
    
    return accuracy

# Use pruner
study = optuna.create_study(
    direction='maximize',
    pruner=optuna.pruners.MedianPruner()
)

# Distributed optimization
study = optuna.create_study(
    study_name='distributed_optimization',
    storage='postgresql://user:pass@localhost/optuna',
    direction='maximize'
)
# Multiple workers can now run optimize simultaneously

When to Use

✅ Appropriate Use Cases:

Training models with many hyperparameters (>3)
When grid search is computationally infeasible
Limited computational budget for hyperparameter search
Need to optimize complex pipelines with conditional parameters
Distributed optimization across multiple machines
Early stopping to save computation on poor configurations
When hyperparameter importance analysis is needed
Neural architecture search (NAS) and AutoML
Any model training where performance matters

❌ Avoid When:

Very small hyperparameter spaces (use grid search)
When reproducibility is critical (randomness in TPE)
Extremely fast models where overhead exceeds benefit
When you need deterministic results every run
Simple problems with well-known default parameters
Very limited compute (use random search with few samples)
When hyperparameters are not independent (TPE assumes some independence)

Common Pitfalls

Using too few trials to converge (<50 for simple, 100+ for complex)
Defining search spaces too wide (wastes samples on bad regions)
Not setting random seeds for reproducibility
Ignoring pruner configuration (wastes time on bad trials)
Optimizing on training set without proper cross-validation
Not accounting for computational cost in objective
Using default TPE parameters for all problems
Forgetting to save and load studies for long optimizations
Not validating final model on held-out test set
Optimizing too many parameters simultaneously (curse of dimensionality)