Time Series Cross-Validation in Python — Core Concepts

Why standard cross-validation fails

Standard k-fold cross-validation randomly assigns observations to folds. For time series, this creates two problems:

  1. Temporal leakage — future data appears in training, inflating accuracy.
  2. Broken autocorrelation — random splitting destroys the time-dependent structure the model needs to learn.

The result: a model that looks great in cross-validation but fails in production.

The three main strategies

1. Expanding window (growing origin)

The training set grows with each fold. Every historical observation is always included.

Fold 1: [====TRAIN====][TEST]
Fold 2: [=====TRAIN=====][TEST]
Fold 3: [======TRAIN======][TEST]
Fold 4: [=======TRAIN=======][TEST]
from sklearn.model_selection import TimeSeriesSplit

tscv = TimeSeriesSplit(n_splits=5)

for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
    X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
    y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    print(f"Fold {fold}: MAE = {mean_absolute_error(y_test, predictions):.4f}")

Best for: when you want to use all available history and your model benefits from more data.

2. Sliding window (fixed origin)

The training set stays the same size. The window slides forward.

Fold 1: [====TRAIN====][TEST]
Fold 2:    [====TRAIN====][TEST]
Fold 3:       [====TRAIN====][TEST]
Fold 4:          [====TRAIN====][TEST]
def sliding_window_cv(X, y, train_size, test_size, step=1):
    """Generate sliding window train/test splits."""
    splits = []
    for start in range(0, len(X) - train_size - test_size + 1, step):
        train_end = start + train_size
        test_end = train_end + test_size
        splits.append((
            list(range(start, train_end)),
            list(range(train_end, test_end)),
        ))
    return splits

Best for: when you suspect old data hurts more than it helps (concept drift, regime changes).

3. Expanding window with gap

Adds a buffer between training and test sets to prevent near-future leakage:

Fold 1: [====TRAIN====]--gap--[TEST]
Fold 2: [=====TRAIN=====]--gap--[TEST]
tscv = TimeSeriesSplit(n_splits=5, gap=7)  # 7-step gap

Best for: when your model uses features that depend on recent history (rolling means, lag features) and you want to ensure no information from the test period leaks into these features.

Controlling fold sizes

TimeSeriesSplit has useful parameters:

tscv = TimeSeriesSplit(
    n_splits=5,
    max_train_size=365,   # cap training window (sliding behavior)
    test_size=30,         # fixed test size per fold
    gap=0,                # gap between train and test
)

Setting max_train_size converts the expanding window into a sliding window.

Evaluating forecast horizons

Different models perform differently at different horizons. Test across multiple horizons:

import pandas as pd
import numpy as np

def multi_horizon_cv(series, model_fn, horizons=[1, 7, 30], n_splits=5):
    """Evaluate model at multiple forecast horizons."""
    results = {h: [] for h in horizons}
    n = len(series)
    test_size = max(horizons)
    
    for i in range(n_splits):
        split_point = n - test_size * (n_splits - i)
        train = series[:split_point]
        
        model = model_fn(train)
        
        for h in horizons:
            if split_point + h <= n:
                predicted = model.forecast(steps=h)
                actual = series.iloc[split_point:split_point + h]
                mae = np.mean(np.abs(predicted.values - actual.values))
                results[h].append(mae)
    
    return {h: np.mean(errs) for h, errs in results.items()}

Metrics for time series evaluation

Choose metrics that match your use case:

MetricFormulaBest for
MAEmean(|actual - predicted|)When all errors matter equally
RMSE√mean((actual - predicted)²)When large errors are especially bad
MAPEmean(|actual - predicted| / |actual|)Percentage terms; fails near zero
sMAPEmean(2|a-p| / (|a|+|p|))Symmetric alternative to MAPE
MASEMAE / naive_MAEScale-independent; compares to naive baseline

MASE (Mean Absolute Scaled Error) is recommended because it works across different scales and penalizes your model relative to a naive “predict last value” baseline:

def mase(actual, predicted, train_series):
    """Mean Absolute Scaled Error — scale-independent metric."""
    naive_errors = np.abs(np.diff(train_series))
    mae_naive = np.mean(naive_errors)
    
    if mae_naive == 0:
        return np.inf
    
    mae_model = np.mean(np.abs(actual - predicted))
    return mae_model / mae_naive

# MASE < 1 → better than naive; MASE > 1 → worse than naive

Common misconception

Many people think more folds always mean better evaluation. In time series, too many folds with small training sets produce unreliable estimates because the model never gets enough history to learn seasonal patterns. The first fold of a 10-fold TimeSeriesSplit might only have 10% of the data — not enough to learn yearly seasonality from monthly data.

The one thing to remember: Time series cross-validation is not just about preventing data leakage — the choice between expanding and sliding windows, the gap size, the number of folds, and the evaluation metric all directly affect whether your model selection process leads to genuinely better forecasts.

pythontime-seriescross-validationmodel-evaluation

See Also