Scikit-Learn Grid Search — Deep Dive

Advanced hyperparameter optimization in scikit-learn — nested CV, pipeline tuning, custom scorers, and scaling strategies for production ML.

Technical foundation

Hyperparameter optimization is a meta-learning problem: finding the configuration that minimizes expected generalization error. GridSearchCV discretizes the hyperparameter space and evaluates each point via cross-validated performance estimation.

The search operates on the assumption that the objective function (CV score as a function of hyperparameters) is smooth enough that a grid at reasonable resolution captures the optimum. This holds for many models but fails when the landscape has sharp, narrow peaks — a known limitation that motivates Bayesian alternatives.

Full GridSearchCV implementation

import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import make_scorer, f1_score

X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=12,
    n_classes=3, weights=[0.6, 0.25, 0.15], random_state=42
)

param_grid = {
    'n_estimators': [100, 200, 400],
    'max_depth': [3, 5, 8],
    'learning_rate': [0.01, 0.05, 0.1],
    'subsample': [0.8, 1.0],
}

# Total combinations: 3 × 3 × 3 × 2 = 54
# With 5-fold CV: 270 model fits

scorer = make_scorer(f1_score, average='weighted')

grid_search = GridSearchCV(
    estimator=GradientBoostingClassifier(random_state=42),
    param_grid=param_grid,
    scoring=scorer,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    n_jobs=-1,
    verbose=1,
    return_train_score=True,  # enables overfitting detection
    error_score='raise',      # fail fast on errors
)

grid_search.fit(X, y)

print(f"Best score: {grid_search.best_score_:.4f}")
print(f"Best params: {grid_search.best_params_}")

Analyzing the search landscape

import pandas as pd

results = pd.DataFrame(grid_search.cv_results_)

# Identify which parameters matter most
for param in ['param_n_estimators', 'param_max_depth', 'param_learning_rate', 'param_subsample']:
    grouped = results.groupby(param)['mean_test_score'].agg(['mean', 'std'])
    print(f"\n{param}:")
    print(grouped.round(4))

# Detect overfitting: large gap between train and test scores
results['overfit_gap'] = results['mean_train_score'] - results['mean_test_score']
overfit_risk = results.nlargest(5, 'overfit_gap')[
    ['params', 'mean_train_score', 'mean_test_score', 'overfit_gap']
]
print(f"\nHighest overfitting risk:\n{overfit_risk}")

This analysis reveals parameter sensitivity. If all n_estimators values produce similar scores, don’t waste compute searching that dimension further.

Pipeline hyperparameter tuning

Grid search integrates seamlessly with pipelines using the step__param naming convention:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA()),
    ('svm', SVC()),
])

param_grid = {
    'pca__n_components': [5, 10, 15, 20],
    'svm__C': [0.1, 1, 10, 100],
    'svm__kernel': ['rbf', 'poly'],
    'svm__gamma': ['scale', 'auto'],
}

grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X, y)

This searches across preprocessing and model hyperparameters simultaneously, correctly refitting the scaler and PCA at each fold — preventing data leakage.

Multiple parameter grids

Search different parameter spaces for different model configurations:

param_grid = [
    # RBF kernel: tune C and gamma
    {
        'svm__kernel': ['rbf'],
        'svm__C': [0.1, 1, 10],
        'svm__gamma': [0.001, 0.01, 0.1],
    },
    # Polynomial kernel: tune C and degree
    {
        'svm__kernel': ['poly'],
        'svm__C': [0.1, 1, 10],
        'svm__degree': [2, 3, 4],
    },
]

Passing a list of dicts avoids testing meaningless combinations (e.g., gamma with a polynomial kernel).

Nested cross-validation

Standard grid search uses the same CV splits for both selecting hyperparameters and estimating performance. This produces optimistically biased estimates. Nested CV adds an outer loop:

from sklearn.model_selection import cross_val_score

# Inner CV: hyperparameter selection
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid={'n_estimators': [100, 200], 'max_depth': [3, 5, 8]},
    cv=inner_cv, scoring='f1_weighted', n_jobs=-1
)

# Outer CV: unbiased performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='f1_weighted')

print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

The outer score is an unbiased estimate of how well the entire tuning + training procedure generalizes. Use this for model comparison; use the inner best params for final deployment.

Custom scoring functions

When business logic doesn’t map to standard metrics:

def profit_score(y_true, y_pred):
    """Score based on business value: TP=$100, FP=-$30, FN=-$80."""
    tp = ((y_pred == 1) & (y_true == 1)).sum()
    fp = ((y_pred == 1) & (y_true == 0)).sum()
    fn = ((y_pred == 0) & (y_true == 1)).sum()
    return tp * 100 - fp * 30 - fn * 80

profit_scorer = make_scorer(profit_score)

grid_search = GridSearchCV(
    estimator, param_grid, scoring=profit_scorer, cv=5
)

RandomizedSearchCV for large spaces

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform

param_distributions = {
    'n_estimators': randint(50, 1000),
    'max_depth': randint(2, 30),
    'learning_rate': loguniform(1e-3, 1e-1),  # log-uniform for learning rates
    'subsample': uniform(0.5, 0.5),
    'min_samples_leaf': randint(1, 50),
    'max_features': uniform(0.3, 0.7),
}

random_search = RandomizedSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_distributions=param_distributions,
    n_iter=100,  # fixed budget: 100 random combinations
    cv=5, scoring='f1_weighted', n_jobs=-1, random_state=42
)

random_search.fit(X, y)

Key insight: loguniform is essential for parameters like learning rate where the meaningful range spans orders of magnitude. Using uniform(0.001, 0.1) wastes 90% of samples in the [0.01, 0.1] range.

HalvingGridSearchCV: successive halving

Scikit-learn 1.0+ offers a faster alternative that uses increasing subsets of data:

from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

halving_search = HalvingGridSearchCV(
    GradientBoostingClassifier(random_state=42),
    param_grid={
        'n_estimators': [50, 100, 200, 400],
        'max_depth': [3, 5, 8, 12],
        'learning_rate': [0.01, 0.05, 0.1, 0.2],
    },
    cv=5, scoring='f1_weighted', n_jobs=-1,
    factor=3,       # eliminate 2/3 of candidates each round
    min_resources=100,  # minimum samples in first round
    random_state=42
)

This evaluates all candidates on a small data subset first, eliminates poor performers, and gives more data to survivors. For 64 candidates, instead of running all 64 on full data, it might run 64 on 100 samples → 21 on 300 → 7 on 900 → 2 on 2700.

Scaling strategies

For production-scale tuning:

Coarse-to-fine: Run a coarse grid first, identify the promising region, then refine with a narrower grid around the best parameters
Early stopping: For iterative models (boosting, neural networks), use early_stopping_rounds to skip evaluating models that aren’t converging
Feature subsampling: Tune on a random feature subset to reduce dimensionality, then validate the best params on full features
Warm starting: Some estimators support warm_start=True, allowing you to incrementally add trees/iterations instead of retraining from scratch

Common pitfalls

Data leakage through preprocessing: If you scale or encode data before grid search, those statistics leak from validation folds into training. Always include preprocessing inside the pipeline.

Too many folds on imbalanced data: With 10-fold CV on a dataset with 2% positive class, some folds may have zero positive examples. Use StratifiedKFold and keep folds reasonable (3-5 for imbalanced data).

Ignoring variance: A parameter combination with mean score 0.85 ± 0.08 is worse than one scoring 0.83 ± 0.01 in most production scenarios. Check std_test_score in results.

Grid too fine early: Searching learning_rate in [0.01, 0.02, 0.03, ..., 0.10] before knowing the right order of magnitude wastes compute. Start with [0.001, 0.01, 0.1].

Tradeoffs

Method	Compute Cost	Coverage	Best For
GridSearchCV	High (exponential)	Complete	Small grids, few parameters
RandomizedSearchCV	Controllable	Probabilistic	Large spaces, continuous params
HalvingGridSearchCV	Medium	Adaptive	Many candidates, large datasets
Bayesian (Optuna/skopt)	Low-medium	Guided	Complex spaces, expensive models

One thing to remember: The compute budget for hyperparameter search should be proportional to the performance gap it can close. If coarse search already reaches 0.94 F1, spending 10x compute to squeeze out 0.945 is rarely worth it outside competition settings.

pythonmachine-learningscikit-learn