Time Series Forecasting — Core Concepts

ARIMA, Prophet, transformer-based forecasters (Temporal Fusion Transformer, PatchTST), evaluation with MAPE and MAE, and why N-BEATS outperformed decades of statistical methods.

The Forecasting Landscape

Time series forecasting spans from classical statistical models (ARIMA, exponential smoothing) to modern deep learning approaches. The “right” approach depends heavily on:

Number of series (one or millions?)
Forecast horizon (hours or years?)
Data availability (years of history or weeks?)
Exogenous variables (do external factors affect the series?)
Update frequency (retrain daily or monthly?)

No single approach wins everywhere. Classical methods still dominate in some settings; deep learning leads in others.

Classical Statistical Methods

ARIMA

ARIMA (AutoRegressive Integrated Moving Average) decomposes a time series into:

AR(p): Autoregressive part — current value is a linear function of past $p$ values
I(d): Differencing — applied $d$ times to remove trend
MA(q): Moving average part — current value is a linear function of past $q$ errors

Notation: ARIMA(p, d, q)

A simple ARIMA(1,1,1) model: $$\Delta y_t = \phi_1 \Delta y_{t-1} + \epsilon_t + \theta_1 \epsilon_{t-1}$$

Where $\Delta y_t = y_t - y_{t-1}$ (first difference), $\phi_1$ is the AR coefficient, and $\theta_1$ is the MA coefficient.

SARIMA adds seasonal components: SARIMA(p,d,q)(P,D,Q)[m] where m is the seasonal period (12 for monthly data with annual seasonality).

ARIMA requires stationarity (constant mean and variance over time) — achieved through differencing and log transformation. Auto-ARIMA (from the pmdarima package) automatically selects p, d, q using AIC/BIC criteria.

Prophet

Facebook Prophet (Taylor & Letham, 2017) is an additive model designed for business time series with strong seasonality and missing data:

$$y(t) = g(t) + s(t) + h(t) + \epsilon_t$$

$g(t)$: Trend (piecewise linear or logistic growth)
$s(t)$: Seasonality (Fourier series for multiple periods)
$h(t)$: Holiday effects (user-specified dates with estimated impact)

Prophet is designed for analysts rather than ML engineers — interpretable components, handles missing data and outliers gracefully, and automatically detects trend changepoints.

Widely used at Facebook (now Meta) for capacity planning. Appropriate for: business metrics with strong weekly/yearly seasonality, series with human-interpretable trends.

Deep Learning Methods

N-BEATS

Oreshkin et al. (2020) N-BEATS (Neural Basis Expansion Analysis) won the M4 forecasting competition (100,000 time series benchmark), outperforming ensemble of 60 traditional methods.

Architecture: stack of fully connected blocks with backward and forward residual links. Each block takes past values and produces:

Backcast: what it “explains” from the input
Forecast: its prediction

Doubly residual: the stack subtracts the backcast from the input (removes what’s been explained) before passing to the next block.

Interpretable variant: specific blocks for trend and seasonality (polynomial and Fourier basis expansions), enabling decomposition like classical methods but with deep learning fitting.

Temporal Fusion Transformer (TFT)

Bryan Lim et al. (Google Brain, 2021) designed TFT specifically for multi-horizon forecasting with:

Multiple input types: Static covariates (features that don’t change), known future inputs (holidays, promotions), observed past inputs (other time series)
Variable selection networks: Attention-based feature importance for each input type
Sequence-to-sequence with temporal attention: LSTM encoder-decoder + multi-head attention to focus on relevant historical timesteps

TFT produces calibrated quantile forecasts — rather than point predictions, outputs the 10th, 50th, and 90th percentile of the future distribution. This is critical for supply chain and capacity planning where you need to know uncertainty, not just the expected value.

PatchTST

Nie et al. (2023) applied ViT-style “patching” to time series:

Divide the time series into non-overlapping patches of length $P$
Apply a linear projection to each patch (like token embedding in ViT)
Process with standard transformer + positional encoding

PatchTST processes patches rather than individual time steps, so the attention complexity is $O((L/P)^2)$ rather than $O(L^2)$. For $L=336, P=16$: 441x fewer attention operations. Dramatically improves long-horizon forecasting quality and efficiency.

Evaluation Metrics

MAE (Mean Absolute Error): $\text{MAE} = \frac{1}{n}\sum_i |y_i - \hat{y}_i|$. Scale-dependent. Easy to interpret in original units.

MAPE (Mean Absolute Percentage Error): $\text{MAPE} = \frac{100}{n}\sum_i |y_i - \hat{y}_i| / y_i$. Scale-independent but undefined when $y_i = 0$ and penalizes under-forecasts more than over-forecasts.

SMAPE (Symmetric MAPE): $\text{SMAPE} = \frac{200}{n}\sum_i |y_i - \hat{y}_i| / (|y_i| + |\hat{y}_i|)$. Symmetric but has other distortions.

MASE (Mean Absolute Scaled Error): MAE divided by MAE of the naïve in-sample one-step forecast. MASE < 1 means the model beats the naïve forecast. Scale-independent, handles zero values. Recommended by Hyndman & Koehler (2006) and used in M competitions.

Quantile loss / Pinball loss: For probabilistic forecasts: $$L_q(y, \hat{y}) = \max(q(y - \hat{y}), (q-1)(y - \hat{y}))$$

Where $q$ is the quantile. Evaluates whether the 90th percentile forecast actually covers 90% of actuals.

One thing to remember: Time series forecasting is one domain where classical statistical methods still compete closely with deep learning — the choice between ARIMA, Prophet, N-BEATS, and transformers depends on dataset size, number of series, horizon length, and interpretability requirements more than any general superiority of one approach.

time-seriesarimaprophettemporal-fusion-transformern-beatsforecasting