Hyperparameter Optimization with Optuna
Definition
Optuna is an open-source hyperparameter optimization framework designed for machine learning. It uses Bayesian optimization with Tree-structured Parzen Estimators (TPE) to efficiently search the hyperparameter space, finding optimal configurations with fewer trials than grid or random search. Optuna supports various optimization algorithms including TPE, CMA-ES, and random search, along with advanced features like pruning unpromising trials early to save computation time. The framework provides a define-by-run API that allows dynamic construction of search spaces, making it flexible for complex pipelines. Optuna integrates seamlessly with popular ML frameworks like scikit-learn, PyTorch, TensorFlow, XGBoost, and LightGBM. It also provides distributed optimization capabilities for scaling across multiple machines and includes visualization tools for analyzing optimization results and hyperparameter importance.
Intuition
Imagine you're trying to find the highest point on a mountain range while blindfolded. Grid search walks every possible path (slow), random search wanders randomly (wasteful), but Optuna is like an experienced guide who remembers every path taken and uses that knowledge to intelligently choose the next direction. It quickly learns which regions are promising and focuses the search there, finding peaks faster while ignoring flat areas.
Mathematical Formula
Step-by-Step Explanation:
- θ: hyperparameter configuration
- y: objective function value (loss/score)
- y*: threshold quantile defining good vs bad trials
- l(θ): density of good configurations (below threshold)
- g(θ): density of bad configurations (above threshold)
- EI(θ): expected improvement, guiding next trial selection
- TPE models P(θ|y) separately for good and bad regions
Real-World Use Cases
CNN architecture tuning: Optimize learning rate, batch size, dropout rates, and convolutional filter counts to maximize image classification accuracy while minimizing training time through early stopping.
Transformer fine-tuning: Search optimal learning rate schedules, warmup steps, and layer-wise decay rates for BERT fine-tuning on domain-specific text classification tasks.
Collaborative filtering optimization: Tune embedding dimensions, regularization coefficients, and learning rates for matrix factorization models to improve recommendation relevance metrics.
Time series model tuning: Optimize window sizes, LSTM units, and regularization for stock price prediction while pruning trials that show poor convergence early.
Implementation
Manual Implementation (No Libraries)
# Manual grid search - inefficient
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV, cross_val_score
import numpy as np
# Define grid of hyperparameters
param_grid = {
'n_estimators': [50, 100, 200],
'max_depth': [3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
# Grid search tries ALL combinations (3 × 4 × 3 × 3 = 108)
rf = RandomForestClassifier(random_state=42)
grid_search = GridSearchCV(
rf, param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
# Time-consuming - no intelligent selection
grid_search.fit(X_train, y_train)
print(f'Best: {grid_search.best_params_}')
print(f'Score: {grid_search.best_score_}')
# Random search - slightly better but still blind
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(
rf, param_grid, n_iter=20, # Only 20 random samples
cv=5, scoring='accuracy', random_state=42
)
# Optuna's TPE approach:
# 1. Sample configurations
# 2. Separate into good/bad groups based on performance
# 3. Model l(θ) and g(θ) distributions
# 4. Sample next θ from region where l(θ)/g(θ) is highest
# 5. Repeat, getting smarter each iteration
Using Libraries (optuna, scikit-learn)
# Optuna hyperparameter optimization
import optuna
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_iris
import numpy as np
# Load data
data = load_iris()
X, y = data.data, data.target
# Define objective function
def objective(trial):
# Define search space
n_estimators = trial.suggest_int('n_estimators', 50, 300)
max_depth = trial.suggest_int('max_depth', 3, 15)
min_samples_split = trial.suggest_int('min_samples_split', 2, 20)
min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 10)
max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
# Model with sampled hyperparameters
clf = RandomForestClassifier(
n_estimators=n_estimators,
max_depth=max_depth,
min_samples_split=min_samples_split,
min_samples_leaf=min_samples_leaf,
max_features=max_features,
random_state=42
)
# Cross-validation score
score = cross_val_score(clf, X, y, cv=5, scoring='accuracy').mean()
return score
# Create study and optimize
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
# Run optimization with 100 trials
study.optimize(objective, n_trials=100, show_progress_bar=True)
# Best results
print(f'Best score: {study.best_value:.4f}')
print(f'Best params: {study.best_params}')
# Visualizations
optuna.visualization.plot_optimization_history(study)
optuna.visualization.plot_param_importances(study)
optuna.visualization.plot_slice(study)
# Save study
study.trials_dataframe().to_csv('optuna_results.csv')
# Pruning example - stop bad trials early
def objective_with_pruning(trial):
# ... model setup ...
for step in range(100): # Training epochs
# ... training step ...
accuracy = model.validate()
# Report intermediate result
trial.report(accuracy, step)
# Prune if unpromising
if trial.should_prune():
raise optuna.TrialPruned()
return accuracy
# Use pruner
study = optuna.create_study(
direction='maximize',
pruner=optuna.pruners.MedianPruner()
)
# Distributed optimization
study = optuna.create_study(
study_name='distributed_optimization',
storage='postgresql://user:pass@localhost/optuna',
direction='maximize'
)
# Multiple workers can now run optimize simultaneously
When to Use
✅ Appropriate Use Cases:
- Training models with many hyperparameters (>3)
- When grid search is computationally infeasible
- Limited computational budget for hyperparameter search
- Need to optimize complex pipelines with conditional parameters
- Distributed optimization across multiple machines
- Early stopping to save computation on poor configurations
- When hyperparameter importance analysis is needed
- Neural architecture search (NAS) and AutoML
- Any model training where performance matters
❌ Avoid When:
- Very small hyperparameter spaces (use grid search)
- When reproducibility is critical (randomness in TPE)
- Extremely fast models where overhead exceeds benefit
- When you need deterministic results every run
- Simple problems with well-known default parameters
- Very limited compute (use random search with few samples)
- When hyperparameters are not independent (TPE assumes some independence)
Common Pitfalls
- Using too few trials to converge (<50 for simple, 100+ for complex)
- Defining search spaces too wide (wastes samples on bad regions)
- Not setting random seeds for reproducibility
- Ignoring pruner configuration (wastes time on bad trials)
- Optimizing on training set without proper cross-validation
- Not accounting for computational cost in objective
- Using default TPE parameters for all problems
- Forgetting to save and load studies for long optimizations
- Not validating final model on held-out test set
- Optimizing too many parameters simultaneously (curse of dimensionality)