Time Series Forecasting with Python — Deep Dive

Stationarity and why it matters

Most statistical forecasting methods assume stationarity — the statistical properties (mean, variance) do not change over time. Non-stationary series must be transformed before modeling.

Testing for stationarity

from statsmodels.tsa.stattools import adfuller, kpss

def stationarity_tests(series: pd.Series) -> dict:
    """Run ADF and KPSS tests; interpret jointly."""
    adf_stat, adf_p, *_ = adfuller(series.dropna())
    kpss_stat, kpss_p, *_ = kpss(series.dropna(), regression="c")
    
    return {
        "adf_statistic": adf_stat,
        "adf_p_value": adf_p,
        "adf_stationary": adf_p < 0.05,
        "kpss_statistic": kpss_stat,
        "kpss_p_value": kpss_p,
        "kpss_stationary": kpss_p > 0.05,
    }

When ADF says stationary but KPSS says non-stationary, the series is likely trend-stationary (stationary after removing a deterministic trend). When both say non-stationary, differencing is needed.

Fractional differencing for financial data

Standard integer differencing (d=1) removes too much memory from price series, destroying useful long-range autocorrelation. Fractional differencing preserves more signal:

import numpy as np
import pandas as pd

def frac_diff_weights(d: float, size: int, threshold: float = 1e-5) -> np.ndarray:
    """Compute weights for fractional differencing."""
    weights = [1.0]
    for k in range(1, size):
        w = -weights[-1] * (d - k + 1) / k
        if abs(w) < threshold:
            break
        weights.append(w)
    return np.array(weights[::-1])

def frac_diff(series: pd.Series, d: float = 0.4) -> pd.Series:
    """Apply fractional differencing of order d."""
    weights = frac_diff_weights(d, len(series))
    width = len(weights)
    result = pd.Series(index=series.index, dtype=float)
    
    for i in range(width - 1, len(series)):
        window = series.iloc[i - width + 1 : i + 1].values
        result.iloc[i] = np.dot(weights, window)
    
    return result

Choose d as the minimum value that makes the series stationary (ADF p-value < 0.05). Typically d is between 0.3 and 0.6 for financial returns.

GARCH for volatility forecasting

ARIMA forecasts the level of a series; GARCH forecasts its volatility. Together they model both the expected return and its uncertainty:

from arch import arch_model

def fit_garch(returns: pd.Series) -> dict:
    """Fit GARCH(1,1) and forecast next-day volatility."""
    model = arch_model(returns * 100, vol="Garch", p=1, q=1, dist="t")
    result = model.fit(disp="off")
    
    forecast = result.forecast(horizon=1)
    next_day_vol = np.sqrt(forecast.variance.iloc[-1, 0]) / 100
    
    return {
        "omega": result.params["omega"],
        "alpha": result.params["alpha[1]"],
        "beta": result.params["beta[1]"],
        "persistence": result.params["alpha[1]"] + result.params["beta[1]"],
        "next_day_volatility": next_day_vol,
        "aic": result.aic,
    }

Persistence (alpha + beta) near 1.0 indicates volatility shocks decay slowly — a hallmark of financial markets. EGARCH and GJR-GARCH variants capture the leverage effect (negative returns increase volatility more than positive returns of equal magnitude).

Vector Autoregression for multivariate series

When multiple time series influence each other (GDP, interest rates, inflation), VAR models capture the interdependencies:

from statsmodels.tsa.api import VAR

def fit_var_model(data: pd.DataFrame, maxlags: int = 12) -> dict:
    """Fit VAR model with automatic lag selection."""
    model = VAR(data)
    
    # Select optimal lag order
    lag_order = model.select_order(maxlags=maxlags)
    optimal_lag = lag_order.selected_orders["aic"]
    
    result = model.fit(optimal_lag)
    
    # Granger causality tests
    causality = {}
    for col in data.columns:
        test = result.test_causality(col, causing=data.columns.drop(col).tolist())
        causality[col] = test.pvalue
    
    return {
        "optimal_lag": optimal_lag,
        "aic": result.aic,
        "granger_causality_p_values": causality,
        "model": result,
    }

Granger causality tests reveal whether one series helps predict another beyond what the series predicts itself. This is particularly useful for identifying leading indicators.

Deep learning architectures

LSTM with attention

import torch
import torch.nn as nn

class LSTMForecaster(nn.Module):
    def __init__(self, input_size: int, hidden_size: int = 64, num_layers: int = 2, horizon: int = 1):
        super().__init__()
        self.lstm = nn.LSTM(input_size, hidden_size, num_layers, batch_first=True, dropout=0.2)
        self.attention = nn.Linear(hidden_size, 1)
        self.fc = nn.Linear(hidden_size, horizon)
    
    def forward(self, x):
        # x: (batch, seq_len, features)
        lstm_out, _ = self.lstm(x)  # (batch, seq_len, hidden)
        
        # Attention weights
        attn_weights = torch.softmax(self.attention(lstm_out), dim=1)
        context = (lstm_out * attn_weights).sum(dim=1)
        
        return self.fc(context)

Temporal Fusion Transformer

The Temporal Fusion Transformer (TFT) by Google handles multiple input types — static metadata (stock sector), known future inputs (day of week), and observed past values — through gated residual networks and multi-head attention:

from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet

# TimeSeriesDataSet handles windowing, normalization, and batching
training = TimeSeriesDataSet(
    data,
    time_idx="time_idx",
    target="value",
    group_ids=["series_id"],
    max_encoder_length=60,
    max_prediction_length=20,
    time_varying_known_reals=["day_of_week", "month"],
    time_varying_unknown_reals=["value", "volume"],
)

model = TemporalFusionTransformer.from_dataset(
    training,
    hidden_size=32,
    attention_head_size=4,
    dropout=0.1,
    learning_rate=0.001,
)

TFT provides interpretable attention weights showing which past time steps and features matter most for each prediction — a rare combination of accuracy and explainability.

Ensemble forecasting

No single model dominates across all series. Combining forecasts reduces variance:

import numpy as np
import pandas as pd

def ensemble_forecast(
    forecasts: dict[str, np.ndarray],
    weights: dict[str, float] | None = None,
) -> np.ndarray:
    """Weighted ensemble of multiple forecast models."""
    if weights is None:
        weights = {k: 1.0 / len(forecasts) for k in forecasts}
    
    total_weight = sum(weights.values())
    result = np.zeros_like(list(forecasts.values())[0])
    
    for name, forecast in forecasts.items():
        result += forecast * weights[name] / total_weight
    
    return result

def optimize_ensemble_weights(
    forecasts: dict[str, np.ndarray],
    actuals: np.ndarray,
) -> dict[str, float]:
    """Find weights that minimize MAE on validation data."""
    from scipy.optimize import minimize
    
    names = list(forecasts.keys())
    forecast_matrix = np.column_stack([forecasts[n] for n in names])
    
    def objective(w):
        combined = forecast_matrix @ w
        return np.mean(np.abs(combined - actuals))
    
    n = len(names)
    constraints = [{"type": "eq", "fun": lambda w: np.sum(w) - 1}]
    bounds = [(0, 1)] * n
    
    result = minimize(objective, x0=np.ones(n) / n, method="SLSQP",
                     bounds=bounds, constraints=constraints)
    
    return dict(zip(names, result.x))

The M4 and M5 forecasting competitions consistently show ensembles outperforming individual models.

Forecast uncertainty quantification

Point forecasts without confidence intervals are dangerous. Two approaches:

Quantile regression

import lightgbm as lgb

def quantile_forecast(X_train, y_train, X_test, quantiles=(0.1, 0.5, 0.9)):
    """Produce prediction intervals via quantile regression."""
    predictions = {}
    
    for q in quantiles:
        model = lgb.LGBMRegressor(objective="quantile", alpha=q, n_estimators=200)
        model.fit(X_train, y_train)
        predictions[f"q{int(q*100)}"] = model.predict(X_test)
    
    return pd.DataFrame(predictions)

Conformal prediction

Conformal prediction provides distribution-free prediction intervals with guaranteed coverage:

def conformal_interval(
    cal_predictions: np.ndarray,
    cal_actuals: np.ndarray,
    test_predictions: np.ndarray,
    alpha: float = 0.1,
) -> tuple[np.ndarray, np.ndarray]:
    """
    Conformal prediction interval with (1-alpha) coverage guarantee.
    """
    residuals = np.abs(cal_actuals - cal_predictions)
    quantile = np.quantile(residuals, 1 - alpha)
    
    lower = test_predictions - quantile
    upper = test_predictions + quantile
    
    return lower, upper

Production deployment patterns

Automated retraining pipeline

from datetime import datetime, timedelta

class ForecastPipeline:
    def __init__(self, model_factory, retrain_interval_days: int = 7):
        self.model_factory = model_factory
        self.retrain_interval = timedelta(days=retrain_interval_days)
        self.last_trained = None
        self.model = None
    
    def predict(self, data: pd.DataFrame, horizon: int) -> dict:
        if self.model is None or datetime.now() - self.last_trained > self.retrain_interval:
            self.model = self.model_factory()
            self.model.fit(data)
            self.last_trained = datetime.now()
        
        forecast = self.model.predict(horizon)
        
        return {
            "forecast": forecast,
            "model_age_hours": (datetime.now() - self.last_trained).total_seconds() / 3600,
            "training_samples": len(data),
        }

Monitoring forecast quality

Track forecast error over time. A sudden increase in error signals regime change and should trigger model retraining or investigation:

  • Rolling MAE: compute MAE over the most recent N predictions.
  • Cumulative sum of errors (CUSUM): detects persistent directional bias.
  • Alert thresholds: if rolling MAE exceeds 2× the historical baseline, flag for review.

The one thing to remember: Production time series forecasting combines domain-appropriate models (statistical for interpretability, deep learning for complex patterns), ensemble methods for robustness, and rigorous uncertainty quantification — with continuous monitoring to detect when the world changes faster than the model adapts.

pythonfinancetime-seriesforecasting

See Also