Time Series Cross-Validation in Python — Core Concepts
Why standard cross-validation fails
Standard k-fold cross-validation randomly assigns observations to folds. For time series, this creates two problems:
- Temporal leakage — future data appears in training, inflating accuracy.
- Broken autocorrelation — random splitting destroys the time-dependent structure the model needs to learn.
The result: a model that looks great in cross-validation but fails in production.
The three main strategies
1. Expanding window (growing origin)
The training set grows with each fold. Every historical observation is always included.
Fold 1: [====TRAIN====][TEST]
Fold 2: [=====TRAIN=====][TEST]
Fold 3: [======TRAIN======][TEST]
Fold 4: [=======TRAIN=======][TEST]
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for fold, (train_idx, test_idx) in enumerate(tscv.split(X)):
X_train, X_test = X.iloc[train_idx], X.iloc[test_idx]
y_train, y_test = y.iloc[train_idx], y.iloc[test_idx]
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(f"Fold {fold}: MAE = {mean_absolute_error(y_test, predictions):.4f}")
Best for: when you want to use all available history and your model benefits from more data.
2. Sliding window (fixed origin)
The training set stays the same size. The window slides forward.
Fold 1: [====TRAIN====][TEST]
Fold 2: [====TRAIN====][TEST]
Fold 3: [====TRAIN====][TEST]
Fold 4: [====TRAIN====][TEST]
def sliding_window_cv(X, y, train_size, test_size, step=1):
"""Generate sliding window train/test splits."""
splits = []
for start in range(0, len(X) - train_size - test_size + 1, step):
train_end = start + train_size
test_end = train_end + test_size
splits.append((
list(range(start, train_end)),
list(range(train_end, test_end)),
))
return splits
Best for: when you suspect old data hurts more than it helps (concept drift, regime changes).
3. Expanding window with gap
Adds a buffer between training and test sets to prevent near-future leakage:
Fold 1: [====TRAIN====]--gap--[TEST]
Fold 2: [=====TRAIN=====]--gap--[TEST]
tscv = TimeSeriesSplit(n_splits=5, gap=7) # 7-step gap
Best for: when your model uses features that depend on recent history (rolling means, lag features) and you want to ensure no information from the test period leaks into these features.
Controlling fold sizes
TimeSeriesSplit has useful parameters:
tscv = TimeSeriesSplit(
n_splits=5,
max_train_size=365, # cap training window (sliding behavior)
test_size=30, # fixed test size per fold
gap=0, # gap between train and test
)
Setting max_train_size converts the expanding window into a sliding window.
Evaluating forecast horizons
Different models perform differently at different horizons. Test across multiple horizons:
import pandas as pd
import numpy as np
def multi_horizon_cv(series, model_fn, horizons=[1, 7, 30], n_splits=5):
"""Evaluate model at multiple forecast horizons."""
results = {h: [] for h in horizons}
n = len(series)
test_size = max(horizons)
for i in range(n_splits):
split_point = n - test_size * (n_splits - i)
train = series[:split_point]
model = model_fn(train)
for h in horizons:
if split_point + h <= n:
predicted = model.forecast(steps=h)
actual = series.iloc[split_point:split_point + h]
mae = np.mean(np.abs(predicted.values - actual.values))
results[h].append(mae)
return {h: np.mean(errs) for h, errs in results.items()}
Metrics for time series evaluation
Choose metrics that match your use case:
| Metric | Formula | Best for |
|---|---|---|
| MAE | mean(|actual - predicted|) | When all errors matter equally |
| RMSE | √mean((actual - predicted)²) | When large errors are especially bad |
| MAPE | mean(|actual - predicted| / |actual|) | Percentage terms; fails near zero |
| sMAPE | mean(2|a-p| / (|a|+|p|)) | Symmetric alternative to MAPE |
| MASE | MAE / naive_MAE | Scale-independent; compares to naive baseline |
MASE (Mean Absolute Scaled Error) is recommended because it works across different scales and penalizes your model relative to a naive “predict last value” baseline:
def mase(actual, predicted, train_series):
"""Mean Absolute Scaled Error — scale-independent metric."""
naive_errors = np.abs(np.diff(train_series))
mae_naive = np.mean(naive_errors)
if mae_naive == 0:
return np.inf
mae_model = np.mean(np.abs(actual - predicted))
return mae_model / mae_naive
# MASE < 1 → better than naive; MASE > 1 → worse than naive
Common misconception
Many people think more folds always mean better evaluation. In time series, too many folds with small training sets produce unreliable estimates because the model never gets enough history to learn seasonal patterns. The first fold of a 10-fold TimeSeriesSplit might only have 10% of the data — not enough to learn yearly seasonality from monthly data.
The one thing to remember: Time series cross-validation is not just about preventing data leakage — the choice between expanding and sliding windows, the gap size, the number of folds, and the evaluation metric all directly affect whether your model selection process leads to genuinely better forecasts.
See Also
- Python Arima Forecasting How ARIMA models use patterns in past numbers to predict the future, explained like a bedtime story.
- Python Autocorrelation Analysis How today's number is connected to yesterday's, and why that connection is the secret weapon of time series analysis.
- Python Exponential Smoothing How exponential smoothing weighs recent events more heavily to predict what happens next, like trusting fresh memories more than old ones.
- Python Multivariate Time Series Why tracking multiple things at once gives you better predictions than tracking each one alone.
- Python Prophet Forecasting How Facebook's Prophet tool predicts the future by breaking data into easy-to-understand pieces.