Time Series Forecasting — Deep Dive

State Space Models: The Structural Approach

State Space Models (SSMs) explicitly model latent state evolution. For a linear Gaussian SSM:

State equation: $x_t = F x_{t-1} + G w_t, \quad w_t \sim \mathcal{N}(0, Q)$

Observation equation: $y_t = H x_t + v_t, \quad v_t \sim \mathcal{N}(0, R)$

The Kalman Filter gives the exact posterior $p(x_t | y_{1:t})$ recursively:

Predict: $x_{t|t-1} = F x_{t-1|t-1}$, $P_{t|t-1} = F P_{t-1|t-1} F^T + Q$

Update: $K_t = P_{t|t-1} H^T (H P_{t|t-1} H^T + R)^{-1}$ (Kalman gain) $x_{t|t} = x_{t|t-1} + K_t(y_t - H x_{t|t-1})$ $P_{t|t} = (I - K_t H) P_{t|t-1}$

Structural time series models (Harvey, 1989): Define latent components (trend $\mu_t$, slope $\nu_t$, seasonal $\gamma_t$) as SSM state:

$$y_t = \mu_t + \gamma_t + \epsilon_t$$ $$\mu_t = \mu_{t-1} + \nu_{t-1} + \zeta_t$$ $$\nu_t = \nu_{t-1} + \xi_t$$

Parameters $Q, R$ estimated via MLE using the Kalman filter likelihood. This is the statistical foundation of Prophet’s trend-seasonality decomposition.

DeepSSM / KVAE: Neural networks for non-linear, non-Gaussian SSMs. The transition $F$ and observation $H$ become neural networks. Variational inference approximates the intractable posterior. Used for complex physical systems (robot state estimation, climate modeling).

Conformal Prediction for Time Series

Standard prediction intervals assume distributional assumptions (Gaussian errors). Conformal prediction provides distribution-free coverage guarantees.

Split conformal prediction for time series:

  1. Fit model on training data $y_{1:T}$
  2. On calibration set $y_{T+1:T+K}$, compute residuals $r_t = |y_t - \hat{y}_t|$
  3. Set $q_\alpha$ = $(1-\alpha)(1 + 1/K)$ quantile of calibration residuals
  4. Prediction interval for $y_{T+K+h}$: $[\hat{y}{T+K+h} - q\alpha, \hat{y}{T+K+h} + q\alpha]$

Marginal coverage guarantee: $P(y_{T+K+h} \in \text{PI}) \geq 1 - \alpha$ under exchangeability.

The exchangeability problem: Time series are not exchangeable — future observations are different from past ones due to distributional shift. Standard conformal prediction’s guarantees rely on exchangeability.

Adaptive Conformal Inference (Gibbs & Candès, 2021): Adaptively adjust $\alpha_t$ at each step based on whether recent predictions fell within the interval:

$$\alpha_{t+1} = \alpha_t + \gamma(\alpha_{nominal} - \mathbf{1}[y_t \notin \hat{C}_t])$$

This provides approximate marginal coverage even under distribution shift, at the cost of interval width adapting over time.

Global vs. Local Models

Local models: Fit one model per time series. ARIMA, Prophet, and classical exponential smoothing are local. Each series has its own parameters.

Advantages: parameters directly interpretable for that series; handles series-specific characteristics Disadvantages: can’t leverage patterns across series; insufficient data for series with short history

Global models: Fit one model across all time series simultaneously. N-BEATS, TFT, DeepAR are global.

Advantages: leverage cross-series patterns (seasonal patterns are shared across retail products); works for short-history series (borrow strength from other series) Disadvantages: poor series-specific characteristics if the global model’s inductive biases don’t fit; harder to debug

When global beats local: When you have many (100+) series with similar underlying structure (same retailer, same region, same type of event), global models typically outperform local ones. The M5 competition (2020, 42,840 Walmart SKUs) was won by solutions using global models — LightGBM and transformer-based approaches that learned across all SKUs simultaneously.

Zero-Shot and Foundation Forecasting Models

The success of LLMs at zero-shot generalization inspired analogous work in time series.

TimeGPT (Garza & Mergenthaler-Canseco, 2023): A large transformer pretrained on 100 billion time series data points from diverse domains (finance, weather, energy, retail). At inference: given a new time series, TimeGPT produces forecasts without any fine-tuning — pure zero-shot.

Reported outperformance of ARIMA, Prophet, and N-BEATS on 70%+ of benchmark datasets in the paper. Limitations: closed-source, requires API access, independently reproducible results uncertain.

Lag-Llama (Rasul et al., 2023): Open-source foundation model for time series forecasting. Built on Llama architecture; inputs are quantile-tokenized time series values (similar to how text is tokenized).

Pretrained on a corpus of real-world time series (weather, energy, finance), Lag-Llama achieves competitive zero-shot forecasting on out-of-domain datasets — demonstrating that the foundation model paradigm transfers to time series.

MOIRAI (Salesforce, 2024): Large-scale foundation model trained on Unified Training of Universal Time Series Forecasting Transformers — 27 billion time series data points. Supports any prediction length and context length via patch-based architecture. Competitive with specialized models across 9 benchmark datasets.

LLMs for time series (GPT-4 as forecaster): Direct prompting of LLMs with time series data (serialized as text) shows surprisingly competitive performance on some benchmarks. Likely leveraging pattern-matching capabilities, not genuine understanding of temporal dynamics.

Distributional Forecasting

Rather than point forecasts $\hat{y}t$, distributional forecasting produces the full conditional distribution $p(y_t | y{<t})$.

Normalizing Flows: Model $p(y_t | x_t)$ as a transformed Gaussian. The transformation is learned with invertible neural networks (RealNVP, GLOW). Enables exact likelihood evaluation and sampling.

DeepAR (Flunkert et al., Amazon, 2017): LSTM-based sequence model that outputs parameters of a distribution (e.g., Negative Binomial for count data). Trained by maximum likelihood. Widely used for demand forecasting at Amazon — can handle thousands of correlated time series simultaneously.

Energy Score for distributional evaluation: Generalization of CRPS (Continuous Ranked Probability Score) to multivariate distributions: $$ES(P, y) = \mathbb{E}{X \sim P}[|X - y|] - \frac{1}{2}\mathbb{E}{X, X’ \sim P}[|X - X’|]$$

Proper scoring rule — minimized in expectation only when $P$ equals the true distribution.

Demand forecasting with intermittent data: Many retail items sell 0 units most days with rare large orders. Standard distributions fail here. Solutions: Croston’s method for intermittent demand, zero-inflated distributions (mixture of zero spike + count distribution).

One thing to remember: The gap between time series forecasting and foundation model capabilities is closing rapidly — models like Lag-Llama and MOIRAI suggest that the “pretrain on everything, fine-tune on specific” paradigm will increasingly dominate time series, just as it has in NLP and vision.

time-seriesstate-space-modelsconformal-predictiontimegptdistributional-forecastingglobal-models

See Also

  • Ab Testing How tech companies run thousands of experiments at once to improve their products — the scientific method applied to everything from button colors to recommendation algorithms.
  • Causal Inference Why correlation isn't causation — and the statistical methods scientists use to actually prove that one thing causes another without running a controlled experiment.
  • Feature Engineering Why the way you describe your data to a machine learning model matters more than which model you choose — the art of turning raw data into something AI can actually learn from.