Scikit-Learn Grid Search — Deep Dive
Technical foundation
Hyperparameter optimization is a meta-learning problem: finding the configuration that minimizes expected generalization error. GridSearchCV discretizes the hyperparameter space and evaluates each point via cross-validated performance estimation.
The search operates on the assumption that the objective function (CV score as a function of hyperparameters) is smooth enough that a grid at reasonable resolution captures the optimum. This holds for many models but fails when the landscape has sharp, narrow peaks — a known limitation that motivates Bayesian alternatives.
Full GridSearchCV implementation
import numpy as np
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import make_scorer, f1_score
X, y = make_classification(
n_samples=5000, n_features=20, n_informative=12,
n_classes=3, weights=[0.6, 0.25, 0.15], random_state=42
)
param_grid = {
'n_estimators': [100, 200, 400],
'max_depth': [3, 5, 8],
'learning_rate': [0.01, 0.05, 0.1],
'subsample': [0.8, 1.0],
}
# Total combinations: 3 × 3 × 3 × 2 = 54
# With 5-fold CV: 270 model fits
scorer = make_scorer(f1_score, average='weighted')
grid_search = GridSearchCV(
estimator=GradientBoostingClassifier(random_state=42),
param_grid=param_grid,
scoring=scorer,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
n_jobs=-1,
verbose=1,
return_train_score=True, # enables overfitting detection
error_score='raise', # fail fast on errors
)
grid_search.fit(X, y)
print(f"Best score: {grid_search.best_score_:.4f}")
print(f"Best params: {grid_search.best_params_}")
Analyzing the search landscape
import pandas as pd
results = pd.DataFrame(grid_search.cv_results_)
# Identify which parameters matter most
for param in ['param_n_estimators', 'param_max_depth', 'param_learning_rate', 'param_subsample']:
grouped = results.groupby(param)['mean_test_score'].agg(['mean', 'std'])
print(f"\n{param}:")
print(grouped.round(4))
# Detect overfitting: large gap between train and test scores
results['overfit_gap'] = results['mean_train_score'] - results['mean_test_score']
overfit_risk = results.nlargest(5, 'overfit_gap')[
['params', 'mean_train_score', 'mean_test_score', 'overfit_gap']
]
print(f"\nHighest overfitting risk:\n{overfit_risk}")
This analysis reveals parameter sensitivity. If all n_estimators values produce similar scores, don’t waste compute searching that dimension further.
Pipeline hyperparameter tuning
Grid search integrates seamlessly with pipelines using the step__param naming convention:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.svm import SVC
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA()),
('svm', SVC()),
])
param_grid = {
'pca__n_components': [5, 10, 15, 20],
'svm__C': [0.1, 1, 10, 100],
'svm__kernel': ['rbf', 'poly'],
'svm__gamma': ['scale', 'auto'],
}
grid_search = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', n_jobs=-1)
grid_search.fit(X, y)
This searches across preprocessing and model hyperparameters simultaneously, correctly refitting the scaler and PCA at each fold — preventing data leakage.
Multiple parameter grids
Search different parameter spaces for different model configurations:
param_grid = [
# RBF kernel: tune C and gamma
{
'svm__kernel': ['rbf'],
'svm__C': [0.1, 1, 10],
'svm__gamma': [0.001, 0.01, 0.1],
},
# Polynomial kernel: tune C and degree
{
'svm__kernel': ['poly'],
'svm__C': [0.1, 1, 10],
'svm__degree': [2, 3, 4],
},
]
Passing a list of dicts avoids testing meaningless combinations (e.g., gamma with a polynomial kernel).
Nested cross-validation
Standard grid search uses the same CV splits for both selecting hyperparameters and estimating performance. This produces optimistically biased estimates. Nested CV adds an outer loop:
from sklearn.model_selection import cross_val_score
# Inner CV: hyperparameter selection
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid={'n_estimators': [100, 200], 'max_depth': [3, 5, 8]},
cv=inner_cv, scoring='f1_weighted', n_jobs=-1
)
# Outer CV: unbiased performance estimation
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring='f1_weighted')
print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
The outer score is an unbiased estimate of how well the entire tuning + training procedure generalizes. Use this for model comparison; use the inner best params for final deployment.
Custom scoring functions
When business logic doesn’t map to standard metrics:
def profit_score(y_true, y_pred):
"""Score based on business value: TP=$100, FP=-$30, FN=-$80."""
tp = ((y_pred == 1) & (y_true == 1)).sum()
fp = ((y_pred == 1) & (y_true == 0)).sum()
fn = ((y_pred == 0) & (y_true == 1)).sum()
return tp * 100 - fp * 30 - fn * 80
profit_scorer = make_scorer(profit_score)
grid_search = GridSearchCV(
estimator, param_grid, scoring=profit_scorer, cv=5
)
RandomizedSearchCV for large spaces
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform
param_distributions = {
'n_estimators': randint(50, 1000),
'max_depth': randint(2, 30),
'learning_rate': loguniform(1e-3, 1e-1), # log-uniform for learning rates
'subsample': uniform(0.5, 0.5),
'min_samples_leaf': randint(1, 50),
'max_features': uniform(0.3, 0.7),
}
random_search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=100, # fixed budget: 100 random combinations
cv=5, scoring='f1_weighted', n_jobs=-1, random_state=42
)
random_search.fit(X, y)
Key insight: loguniform is essential for parameters like learning rate where the meaningful range spans orders of magnitude. Using uniform(0.001, 0.1) wastes 90% of samples in the [0.01, 0.1] range.
HalvingGridSearchCV: successive halving
Scikit-learn 1.0+ offers a faster alternative that uses increasing subsets of data:
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV
halving_search = HalvingGridSearchCV(
GradientBoostingClassifier(random_state=42),
param_grid={
'n_estimators': [50, 100, 200, 400],
'max_depth': [3, 5, 8, 12],
'learning_rate': [0.01, 0.05, 0.1, 0.2],
},
cv=5, scoring='f1_weighted', n_jobs=-1,
factor=3, # eliminate 2/3 of candidates each round
min_resources=100, # minimum samples in first round
random_state=42
)
This evaluates all candidates on a small data subset first, eliminates poor performers, and gives more data to survivors. For 64 candidates, instead of running all 64 on full data, it might run 64 on 100 samples → 21 on 300 → 7 on 900 → 2 on 2700.
Scaling strategies
For production-scale tuning:
-
Coarse-to-fine: Run a coarse grid first, identify the promising region, then refine with a narrower grid around the best parameters
-
Early stopping: For iterative models (boosting, neural networks), use
early_stopping_roundsto skip evaluating models that aren’t converging -
Feature subsampling: Tune on a random feature subset to reduce dimensionality, then validate the best params on full features
-
Warm starting: Some estimators support
warm_start=True, allowing you to incrementally add trees/iterations instead of retraining from scratch
Common pitfalls
Data leakage through preprocessing: If you scale or encode data before grid search, those statistics leak from validation folds into training. Always include preprocessing inside the pipeline.
Too many folds on imbalanced data: With 10-fold CV on a dataset with 2% positive class, some folds may have zero positive examples. Use StratifiedKFold and keep folds reasonable (3-5 for imbalanced data).
Ignoring variance: A parameter combination with mean score 0.85 ± 0.08 is worse than one scoring 0.83 ± 0.01 in most production scenarios. Check std_test_score in results.
Grid too fine early: Searching learning_rate in [0.01, 0.02, 0.03, ..., 0.10] before knowing the right order of magnitude wastes compute. Start with [0.001, 0.01, 0.1].
Tradeoffs
| Method | Compute Cost | Coverage | Best For |
|---|---|---|---|
| GridSearchCV | High (exponential) | Complete | Small grids, few parameters |
| RandomizedSearchCV | Controllable | Probabilistic | Large spaces, continuous params |
| HalvingGridSearchCV | Medium | Adaptive | Many candidates, large datasets |
| Bayesian (Optuna/skopt) | Low-medium | Guided | Complex spaces, expensive models |
One thing to remember: The compute budget for hyperparameter search should be proportional to the performance gap it can close. If coarse search already reaches 0.94 F1, spending 10x compute to squeeze out 0.945 is rarely worth it outside competition settings.
See Also
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.
- Ai Agents Architecture How AI systems go from answering questions to actually doing things — the design patterns that turn language models into autonomous agents that browse, code, and plan.
- Ai Agents ChatGPT answers questions. AI agents actually do things — browse the web, write code, send emails, and keep going until the job is done. Here's the difference.
- Ai Ethics Why building AI fairly is harder than it sounds — bias, accountability, privacy, and who gets to decide what AI is allowed to do.
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'