Cross-Validation in Python — Deep Dive
The Statistical Foundation
Cross-validation estimates the expected prediction error — the gap between training performance and real-world performance. Formally, we want to approximate:
E[L(Y, f̂(X))]
where L is a loss function, Y is the true target, and f̂ is the model trained on a sample. A single hold-out set gives one noisy estimate. By averaging over k disjoint test sets, we reduce variance at the cost of increased computation.
The bias-variance tradeoff in CV depends on k: large k (e.g., leave-one-out) has low bias but high variance because training sets overlap almost completely. Small k (e.g., 2-fold) has higher bias because the model sees less data per fold. The sweet spot for most practical work is k = 5 or k = 10.
Scikit-Learn Implementations
Basic K-Fold
from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier
kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)
scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")
Stratified K-Fold
Essential for imbalanced targets:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring="f1")
Each fold preserves the original class ratio. For a dataset with 5 percent positive class, each fold will contain approximately 5 percent positives.
Grouped K-Fold
When samples from the same group (e.g., same patient, same user) must not appear in both train and test:
from sklearn.model_selection import GroupKFold
gkf = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=gkf, groups=group_ids, scoring="roc_auc")
This prevents data leakage when observations within a group are correlated. Medical studies and user-behavior datasets almost always need grouped splitting.
Time-Series Split
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}")
Training windows grow incrementally. If you want a fixed-size rolling window instead, implement a custom splitter or use max_train_size:
tscv = TimeSeriesSplit(n_splits=5, max_train_size=1000)
Nested Cross-Validation
When you use CV to both select hyperparameters and evaluate the model, you get an optimistic estimate because the outer score is contaminated by the tuning process. Nested CV solves this with two loops:
- Inner loop: tunes hyperparameters (e.g.,
GridSearchCV). - Outer loop: evaluates the tuned model on held-out data.
from sklearn.model_selection import GridSearchCV, cross_val_score
param_grid = {"n_estimators": [50, 100, 200], "max_depth": [3, 5, 10]}
inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=inner_cv,
scoring="f1",
n_jobs=-1,
)
nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring="f1")
print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")
The outer score is unbiased because the test fold was never used during hyperparameter selection.
Leak-Proof Pipelines
A subtle but critical mistake: fitting preprocessing (scaling, encoding, imputation) on the full dataset before splitting. This leaks information from the test fold into training. The fix is to wrap all transformations inside a Pipeline:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
pipe = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])
scores = cross_val_score(pipe, X, y, cv=skf, scoring="f1")
Now imputation and scaling are fit only on the training folds and applied to the test fold. This small change can mean the difference between a trustworthy estimate and a dangerously optimistic one.
Repeated Cross-Validation
A single run of 5-fold CV gives 5 scores. Their standard deviation may be high. Repeated CV runs multiple rounds with different random shuffles:
from sklearn.model_selection import RepeatedStratifiedKFold
rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=rskf, scoring="accuracy")
print(f"Repeated CV: {scores.mean():.4f} ± {scores.std():.4f}")
This gives 50 scores total and a more stable mean, though it takes 10 times longer.
Custom Splitters
Sometimes built-in splitters do not fit your needs. Scikit-learn accepts any iterable of (train_indices, test_indices):
def purged_time_split(df, date_col, gap_days=7, n_splits=5):
"""Time-series split with a gap to prevent information leakage."""
dates = df[date_col].sort_values().unique()
fold_size = len(dates) // (n_splits + 1)
for i in range(n_splits):
train_end = dates[(i + 1) * fold_size]
test_start = train_end + pd.Timedelta(days=gap_days)
test_end = dates[min((i + 2) * fold_size, len(dates) - 1)]
train_idx = df[df[date_col] <= train_end].index
test_idx = df[(df[date_col] >= test_start) & (df[date_col] <= test_end)].index
if len(test_idx) > 0:
yield train_idx.to_numpy(), test_idx.to_numpy()
The “purge gap” removes samples near the boundary to prevent leakage from lagged features.
Scoring Metrics
cross_val_score accepts any scorer from sklearn.metrics. Use cross_validate for multiple metrics at once:
from sklearn.model_selection import cross_validate
results = cross_validate(
pipe, X, y, cv=skf,
scoring=["accuracy", "f1", "roc_auc"],
return_train_score=True,
)
print(f"Test AUC: {results['test_roc_auc'].mean():.4f}")
print(f"Train AUC: {results['train_roc_auc'].mean():.4f}")
A large gap between train and test scores signals overfitting.
Computational Considerations
| Strategy | Folds | Cost | Use Case |
|---|---|---|---|
| 5-fold | 5 | Moderate | Default for most problems |
| 10-fold | 10 | Higher | When stability matters |
| LOO | n | Very high | Tiny datasets (<200 rows) |
| Repeated 5×10 | 50 | 10× of 5-fold | Publication-grade estimates |
| Nested (5×3) | 15 inner + 5 outer | ~20× | When tuning + evaluating |
For large datasets (millions of rows), even 5-fold can be slow. Consider using a single hold-out set plus bootstrapped confidence intervals, or subsample for CV and use the full dataset for final training.
Real-World Gotchas
- Shuffling time-series data: Destroys temporal dependencies. Always use
TimeSeriesSplit. - Ignoring groups: If patient A appears in both train and test, the model memorizes patient-specific patterns instead of generalizable ones.
- Reporting best-fold score: Always report the mean across folds, not the best single fold.
- Comparing models on different splits: Use the same CV splitter (same
random_state) when comparing models so the comparison is fair.
One thing to remember: Cross-validation is not just a technique — it is a discipline. Done correctly, it is the single most reliable way to know if your model will work in the real world.
See Also
- Python Confusion Matrix See how a simple grid of right and wrong answers reveals what your computer is actually getting confused about.
- Python Model Evaluation Metrics Discover why asking 'how good is my model?' needs more than one number to get an honest answer.
- Python Roc Auc Curves Understand how one picture and one number tell you whether a computer's predictions are trustworthy or just lucky guesses.
- Python Sklearn Learning Curves Why your machine learning model might need more data — or a simpler brain — explained with zero jargon.
- Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.