Cross-Validation in Python — Deep Dive

Implement k-fold, stratified, grouped, and time-series cross-validation in scikit-learn with leak-proof pipelines and nested CV for hyperparameter selection.

The Statistical Foundation

Cross-validation estimates the expected prediction error — the gap between training performance and real-world performance. Formally, we want to approximate:

E[L(Y, f̂(X))]

where L is a loss function, Y is the true target, and f̂ is the model trained on a sample. A single hold-out set gives one noisy estimate. By averaging over k disjoint test sets, we reduce variance at the cost of increased computation.

The bias-variance tradeoff in CV depends on k: large k (e.g., leave-one-out) has low bias but high variance because training sets overlap almost completely. Small k (e.g., 2-fold) has higher bias because the model sees less data per fold. The sweet spot for most practical work is k = 5 or k = 10.

Scikit-Learn Implementations

Basic K-Fold

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestClassifier

kf = KFold(n_splits=5, shuffle=True, random_state=42)
model = RandomForestClassifier(n_estimators=100, random_state=42)

scores = cross_val_score(model, X, y, cv=kf, scoring="accuracy")
print(f"Mean accuracy: {scores.mean():.4f} ± {scores.std():.4f}")

Stratified K-Fold

Essential for imbalanced targets:

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=skf, scoring="f1")

Each fold preserves the original class ratio. For a dataset with 5 percent positive class, each fold will contain approximately 5 percent positives.

Grouped K-Fold

When samples from the same group (e.g., same patient, same user) must not appear in both train and test:

from sklearn.model_selection import GroupKFold

gkf = GroupKFold(n_splits=5)
scores = cross_val_score(model, X, y, cv=gkf, groups=group_ids, scoring="roc_auc")

This prevents data leakage when observations within a group are correlated. Medical studies and user-behavior datasets almost always need grouped splitting.

Time-Series Split

from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
    print(f"Train: {train_idx[0]}-{train_idx[-1]}, Test: {test_idx[0]}-{test_idx[-1]}")

Training windows grow incrementally. If you want a fixed-size rolling window instead, implement a custom splitter or use max_train_size:

tscv = TimeSeriesSplit(n_splits=5, max_train_size=1000)

Nested Cross-Validation

When you use CV to both select hyperparameters and evaluate the model, you get an optimistic estimate because the outer score is contaminated by the tuning process. Nested CV solves this with two loops:

Inner loop: tunes hyperparameters (e.g., GridSearchCV).
Outer loop: evaluates the tuned model on held-out data.

from sklearn.model_selection import GridSearchCV, cross_val_score

param_grid = {"n_estimators": [50, 100, 200], "max_depth": [3, 5, 10]}

inner_cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=42)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

grid_search = GridSearchCV(
    RandomForestClassifier(random_state=42),
    param_grid,
    cv=inner_cv,
    scoring="f1",
    n_jobs=-1,
)

nested_scores = cross_val_score(grid_search, X, y, cv=outer_cv, scoring="f1")
print(f"Nested CV F1: {nested_scores.mean():.4f} ± {nested_scores.std():.4f}")

The outer score is unbiased because the test fold was never used during hyperparameter selection.

Leak-Proof Pipelines

A subtle but critical mistake: fitting preprocessing (scaling, encoding, imputation) on the full dataset before splitting. This leaks information from the test fold into training. The fix is to wrap all transformations inside a Pipeline:

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer

pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
    ("model", RandomForestClassifier(n_estimators=100, random_state=42)),
])

scores = cross_val_score(pipe, X, y, cv=skf, scoring="f1")

Now imputation and scaling are fit only on the training folds and applied to the test fold. This small change can mean the difference between a trustworthy estimate and a dangerously optimistic one.

Repeated Cross-Validation

A single run of 5-fold CV gives 5 scores. Their standard deviation may be high. Repeated CV runs multiple rounds with different random shuffles:

from sklearn.model_selection import RepeatedStratifiedKFold

rskf = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
scores = cross_val_score(model, X, y, cv=rskf, scoring="accuracy")
print(f"Repeated CV: {scores.mean():.4f} ± {scores.std():.4f}")

This gives 50 scores total and a more stable mean, though it takes 10 times longer.

Custom Splitters

Sometimes built-in splitters do not fit your needs. Scikit-learn accepts any iterable of (train_indices, test_indices):

def purged_time_split(df, date_col, gap_days=7, n_splits=5):
    """Time-series split with a gap to prevent information leakage."""
    dates = df[date_col].sort_values().unique()
    fold_size = len(dates) // (n_splits + 1)
    
    for i in range(n_splits):
        train_end = dates[(i + 1) * fold_size]
        test_start = train_end + pd.Timedelta(days=gap_days)
        test_end = dates[min((i + 2) * fold_size, len(dates) - 1)]
        
        train_idx = df[df[date_col] <= train_end].index
        test_idx = df[(df[date_col] >= test_start) & (df[date_col] <= test_end)].index
        
        if len(test_idx) > 0:
            yield train_idx.to_numpy(), test_idx.to_numpy()

The “purge gap” removes samples near the boundary to prevent leakage from lagged features.

Scoring Metrics

cross_val_score accepts any scorer from sklearn.metrics. Use cross_validate for multiple metrics at once:

from sklearn.model_selection import cross_validate

results = cross_validate(
    pipe, X, y, cv=skf,
    scoring=["accuracy", "f1", "roc_auc"],
    return_train_score=True,
)
print(f"Test AUC: {results['test_roc_auc'].mean():.4f}")
print(f"Train AUC: {results['train_roc_auc'].mean():.4f}")

A large gap between train and test scores signals overfitting.

Computational Considerations

Strategy	Folds	Cost	Use Case
5-fold	5	Moderate	Default for most problems
10-fold	10	Higher	When stability matters
LOO	n	Very high	Tiny datasets (<200 rows)
Repeated 5×10	50	10× of 5-fold	Publication-grade estimates
Nested (5×3)	15 inner + 5 outer	~20×	When tuning + evaluating

For large datasets (millions of rows), even 5-fold can be slow. Consider using a single hold-out set plus bootstrapped confidence intervals, or subsample for CV and use the full dataset for final training.

Real-World Gotchas

Shuffling time-series data: Destroys temporal dependencies. Always use TimeSeriesSplit.
Ignoring groups: If patient A appears in both train and test, the model memorizes patient-specific patterns instead of generalizable ones.
Reporting best-fold score: Always report the mean across folds, not the best single fold.
Comparing models on different splits: Use the same CV splitter (same random_state) when comparing models so the comparison is fair.

One thing to remember: Cross-validation is not just a technique — it is a discipline. Done correctly, it is the single most reliable way to know if your model will work in the real world.

pythoncross-validationmachine-learningdata-science