Hyperparameter Tuning in Python — Deep Dive

The Search Space Problem

A gradient-boosted tree model like XGBoost has parameters including max_depth, learning_rate, n_estimators, min_child_weight, subsample, colsample_bytree, and reg_alpha. If you tested just 5 values for each of these 7 parameters, grid search would need 5^7 = 78,125 model fits. At 30 seconds per fit, that is 27 days of compute. Clearly, brute force does not scale.

Grid Search with Scikit-Learn

For small search spaces, GridSearchCV remains the simplest option:

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {
    "n_estimators": [100, 200],
    "max_depth": [3, 5, 7],
    "learning_rate": [0.01, 0.1],
}

grid = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring="f1",
    n_jobs=-1,
    verbose=1,
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best F1: {grid.best_score_:.4f}")

The results are stored in grid.cv_results_, a dictionary containing scores for every parameter combination — useful for plotting heatmaps of performance across two parameters.

RandomizedSearchCV samples from distributions rather than enumerating a grid:

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_distributions = {
    "n_estimators": randint(50, 500),
    "max_depth": randint(2, 15),
    "learning_rate": uniform(0.001, 0.3),
    "subsample": uniform(0.5, 0.5),
    "colsample_bytree": uniform(0.5, 0.5),
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions,
    n_iter=100,
    cv=5,
    scoring="f1",
    n_jobs=-1,
    random_state=42,
)
random_search.fit(X_train, y_train)

Bergstra and Bengio (2012) proved that with 60 random trials, you have a 95 percent chance of finding a configuration within the top 5 percent of the search space, assuming the important dimensions are low-dimensional. This makes random search remarkably efficient.

Bayesian Optimization with Optuna

Optuna is the most popular Bayesian tuning library in Python. It uses Tree-structured Parzen Estimators (TPE) by default:

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 500),
        "max_depth": trial.suggest_int("max_depth", 2, 15),
        "learning_rate": trial.suggest_float("learning_rate", 0.001, 0.3, log=True),
        "subsample": trial.suggest_float("subsample", 0.5, 1.0),
        "min_child_weight": trial.suggest_int("min_child_weight", 1, 10),
    }
    
    model = GradientBoostingClassifier(**params, random_state=42)
    score = cross_val_score(model, X_train, y_train, cv=5, scoring="f1").mean()
    return score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best F1: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Key Optuna features:

  • Pruning: Uses MedianPruner to stop unpromising trials early based on intermediate results.
  • Distributed search: Multiple workers can share a study via a database backend.
  • Visualization: Built-in plots for parameter importance, optimization history, and parallel coordinates.
optuna.visualization.plot_param_importances(study)
optuna.visualization.plot_optimization_history(study)

Hyperband and Successive Halving

Scikit-learn includes HalvingRandomSearchCV:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingRandomSearchCV

halving = HalvingRandomSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions,
    n_candidates=200,
    factor=3,
    cv=5,
    scoring="f1",
    random_state=42,
)
halving.fit(X_train, y_train)

The factor=3 means each round keeps the top 1/3 of candidates and triples their resource budget. Starting with 200 candidates, round 1 uses minimal resources on all 200, round 2 uses 3× resources on 67, round 3 uses 9× on 22, and so on until one winner remains.

Nested Cross-Validation for Unbiased Estimates

If you tune and evaluate on the same CV folds, your reported score is optimistic. Nested CV adds an outer evaluation loop:

from sklearn.model_selection import cross_val_score, StratifiedKFold

inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

tuned_model = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid,
    cv=inner_cv,
    scoring="f1",
    n_jobs=-1,
)

nested_scores = cross_val_score(tuned_model, X, y, cv=outer_cv, scoring="f1")
print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

Early Stopping

For iterative models, early stopping halts training when validation performance stops improving, acting as implicit hyperparameter tuning for the number of iterations:

import xgboost as xgb

model = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=5,
    early_stopping_rounds=50,
    eval_metric="logloss",
)
model.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False,
)
print(f"Stopped at {model.best_iteration} iterations")

This eliminates the need to tune n_estimators separately, saving significant compute.

Multi-Objective Optimization

Sometimes you want to optimize for both accuracy and inference speed, or precision and recall. Optuna supports multi-objective optimization:

def multi_objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 10, 500),
        "max_depth": trial.suggest_int("max_depth", 2, 10),
    }
    model = GradientBoostingClassifier(**params, random_state=42)
    
    score = cross_val_score(model, X_train, y_train, cv=3, scoring="f1").mean()
    
    import time
    start = time.time()
    model.fit(X_train, y_train)
    train_time = time.time() - start
    
    return score, train_time

study = optuna.create_study(directions=["maximize", "minimize"])
study.optimize(multi_objective, n_trials=50)

The Pareto front shows configurations that are optimal along different tradeoff curves.

Production Workflow

A typical tuning pipeline for a production model:

  1. Baseline: Train with defaults, record the score.
  2. Broad search: 100 random trials to identify promising regions.
  3. Focused search: 50 Bayesian trials in the narrowed space.
  4. Nested CV: Get an unbiased estimate of the tuned model.
  5. Final train: Retrain on all data with the best hyperparameters.
  6. Log everything: Store parameters, scores, and artifacts in MLflow or a similar tracking tool.

Tradeoffs Summary

MethodTrials NeededFinds Optimum?ParallelizableComplexity
Grid searchk^nIn the grid, yesFullyLow
Random search50-200ApproximatelyFullyLow
Bayesian (Optuna)30-100Usually closeWith DB backendMedium
Hyperband50-200If early perf predicts finalPartiallyMedium

Common Pitfalls

  1. Tuning on the test set: Never. Use a separate validation set or CV.
  2. Too narrow a search range: If all sampled values are at the boundary, widen the range.
  3. Ignoring interactions: learning_rate and n_estimators are coupled — low learning rates need more estimators.
  4. Not logging experiments: Without logs, you cannot reproduce or compare results.

One thing to remember: Hyperparameter tuning is an investment with diminishing returns — spend 80 percent of your budget on the first 20 percent of improvement, then lock in your best settings and move on.

pythonhyperparameter-tuningmachine-learningoptimization

See Also