Hyperparameter Tuning in Python — Deep Dive
The Search Space Problem
A gradient-boosted tree model like XGBoost has parameters including max_depth, learning_rate, n_estimators, min_child_weight, subsample, colsample_bytree, and reg_alpha. If you tested just 5 values for each of these 7 parameters, grid search would need 5^7 = 78,125 model fits. At 30 seconds per fit, that is 27 days of compute. Clearly, brute force does not scale.
Grid Search with Scikit-Learn
For small search spaces, GridSearchCV remains the simplest option:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier
param_grid = {
"n_estimators": [100, 200],
"max_depth": [3, 5, 7],
"learning_rate": [0.01, 0.1],
}
grid = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid,
cv=5,
scoring="f1",
n_jobs=-1,
verbose=1,
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best F1: {grid.best_score_:.4f}")
The results are stored in grid.cv_results_, a dictionary containing scores for every parameter combination — useful for plotting heatmaps of performance across two parameters.
Random Search
RandomizedSearchCV samples from distributions rather than enumerating a grid:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_distributions = {
"n_estimators": randint(50, 500),
"max_depth": randint(2, 15),
"learning_rate": uniform(0.001, 0.3),
"subsample": uniform(0.5, 0.5),
"colsample_bytree": uniform(0.5, 0.5),
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_iter=100,
cv=5,
scoring="f1",
n_jobs=-1,
random_state=42,
)
random_search.fit(X_train, y_train)
Bergstra and Bengio (2012) proved that with 60 random trials, you have a 95 percent chance of finding a configuration within the top 5 percent of the search space, assuming the important dimensions are low-dimensional. This makes random search remarkably efficient.
Bayesian Optimization with Optuna
Optuna is the most popular Bayesian tuning library in Python. It uses Tree-structured Parzen Estimators (TPE) by default:
import optuna
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 50, 500),
"max_depth": trial.suggest_int("max_depth", 2, 15),
"learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
"subsample": trial.suggest_float("subsample", 0.5, 1.0),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
}
model = GradientBoostingClassifier(**params, random_state=42)
score = cross_val_score(model, X_train, y_train, cv=5, scoring="f1").mean()
return score
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=3600)
print(f"Best F1: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
Key Optuna features:
- Pruning: Uses
MedianPrunerto stop unpromising trials early based on intermediate results. - Distributed search: Multiple workers can share a study via a database backend.
- Visualization: Built-in plots for parameter importance, optimization history, and parallel coordinates.
optuna.visualization.plot_param_importances(study)
optuna.visualization.plot_optimization_history(study)
Hyperband and Successive Halving
Scikit-learn includes HalvingRandomSearchCV:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV
halving = HalvingRandomSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions,
n_candidates=200,
factor=3,
cv=5,
scoring="f1",
random_state=42,
)
halving.fit(X_train, y_train)
The factor=3 means each round keeps the top 1/3 of candidates and triples their resource budget. Starting with 200 candidates, round 1 uses minimal resources on all 200, round 2 uses 3× resources on 67, round 3 uses 9× on 22, and so on until one winner remains.
Nested Cross-Validation for Unbiased Estimates
If you tune and evaluate on the same CV folds, your reported score is optimistic. Nested CV adds an outer evaluation loop:
from sklearn.model_selection import cross_val_score, StratifiedKFold
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
tuned_model = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid,
cv=inner_cv,
scoring="f1",
n_jobs=-1,
)
nested_scores = cross_val_score(tuned_model, X, y, cv=outer_cv, scoring="f1")
print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
Early Stopping
For iterative models, early stopping halts training when validation performance stops improving, acting as implicit hyperparameter tuning for the number of iterations:
import xgboost as xgb
model = xgb.XGBClassifier(
n_estimators=1000,
learning_rate=0.05,
max_depth=5,
early_stopping_rounds=50,
eval_metric="logloss",
)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False,
)
print(f"Stopped at {model.best_iteration} iterations")
This eliminates the need to tune n_estimators separately, saving significant compute.
Multi-Objective Optimization
Sometimes you want to optimize for both accuracy and inference speed, or precision and recall. Optuna supports multi-objective optimization:
def multi_objective(trial):
params = {
"n_estimators": trial.suggest_int("n_estimators", 10, 500),
"max_depth": trial.suggest_int("max_depth", 2, 10),
}
model = GradientBoostingClassifier(**params, random_state=42)
score = cross_val_score(model, X_train, y_train, cv=3, scoring="f1").mean()
import time
start = time.time()
model.fit(X_train, y_train)
train_time = time.time() - start
return score, train_time
study = optuna.create_study(directions=["maximize", "minimize"])
study.optimize(multi_objective, n_trials=50)
The Pareto front shows configurations that are optimal along different tradeoff curves.
Production Workflow
A typical tuning pipeline for a production model:
- Baseline: Train with defaults, record the score.
- Broad search: 100 random trials to identify promising regions.
- Focused search: 50 Bayesian trials in the narrowed space.
- Nested CV: Get an unbiased estimate of the tuned model.
- Final train: Retrain on all data with the best hyperparameters.
- Log everything: Store parameters, scores, and artifacts in MLflow or a similar tracking tool.
Tradeoffs Summary
| Method | Trials Needed | Finds Optimum? | Parallelizable | Complexity |
|---|---|---|---|---|
| Grid search | k^n | In the grid, yes | Fully | Low |
| Random search | 50-200 | Approximately | Fully | Low |
| Bayesian (Optuna) | 30-100 | Usually close | With DB backend | Medium |
| Hyperband | 50-200 | If early perf predicts final | Partially | Medium |
Common Pitfalls
- Tuning on the test set: Never. Use a separate validation set or CV.
- Too narrow a search range: If all sampled values are at the boundary, widen the range.
- Ignoring interactions:
learning_rateandn_estimatorsare coupled — low learning rates need more estimators. - Not logging experiments: Without logs, you cannot reproduce or compare results.
One thing to remember: Hyperparameter tuning is an investment with diminishing returns — spend 80 percent of your budget on the first 20 percent of improvement, then lock in your best settings and move on.
See Also
- Python Knowledge Distillation How a big expert AI teaches a tiny student AI to be almost as smart — like a professor writing a cheat sheet for an exam.
- Python Model Compression Methods All the ways Python developers shrink massive AI models to fit on phones and tiny devices — like packing for a trip with a carry-on bag.
- Python Model Pruning Techniques Why cutting away parts of an AI's brain can make it faster without making it dumber.
- Python Neural Architecture Search How AI designs its own brain structure — like a robot architect building the perfect house by trying thousands of floor plans.
- Python Pytorch Quantization How shrinking numbers inside an AI model makes it run faster on phones and cheaper servers without losing much accuracy.