Demand Forecasting in Python — Deep Dive

Build production forecast pipelines with SARIMA, LightGBM, hierarchical reconciliation, and cold-start strategies in Python.

Production demand forecasting is a pipeline problem. The modeling step gets the most attention, but data preparation, feature engineering, hierarchical reconciliation, and monitoring are where most projects succeed or fail.

Pipeline architecture

A robust forecast pipeline runs on a schedule (daily or weekly) and follows these stages:

Extract — pull sales transactions, promotions calendar, weather data, and inventory snapshots.
Clean — handle returns, out-of-stock periods (demand was present but unmeasurable), and data gaps.
Feature engineering — create lagged features, rolling statistics, calendar indicators, and external regressors.
Model training — fit per-SKU or grouped models.
Reconciliation — ensure forecasts at different hierarchy levels are consistent.
Output — write forecasts to a database for consumption by inventory and planning systems.
Monitor — track accuracy metrics and alert when error spikes.

Handling out-of-stock censoring

When a product is out of stock, observed sales are zero — but true demand was not. Naive models trained on raw data learn to predict lower demand after stockouts, creating a vicious cycle (less stock → lower forecast → even less stock).

import pandas as pd
import numpy as np

def uncensor_demand(sales: pd.Series, inventory: pd.Series, threshold: int = 0) -> pd.Series:
    """Replace zero-sales days where inventory was also zero with NaN for interpolation."""
    censored = (sales == 0) & (inventory <= threshold)
    uncensored = sales.copy()
    uncensored[censored] = np.nan
    return uncensored.interpolate(method="linear").clip(lower=0)

This simple interpolation is a starting point. More sophisticated approaches use Tobit regression or expectation-maximization to estimate the censored demand distribution.

SARIMA with automatic order selection

import pmdarima as pm

def fit_sarima(series: pd.Series, seasonal_period: int = 7) -> pm.ARIMA:
    model = pm.auto_arima(
        series,
        seasonal=True,
        m=seasonal_period,
        stepwise=True,
        suppress_warnings=True,
        error_action="ignore",
        max_order=8,
        trace=False,
    )
    return model

# Forecast next 14 days
model = fit_sarima(weekly_sales, seasonal_period=52)
forecast, conf_int = model.predict(n_periods=14, return_conf_int=True)

SARIMA works well for individual high-volume SKUs with clear seasonality. For thousands of SKUs, fitting individual SARIMA models becomes slow. Global models (below) scale better.

LightGBM global model

A single gradient boosting model trained across all SKUs simultaneously. Each row is a (SKU, date) pair with features:

import lightgbm as lgb

def create_features(df: pd.DataFrame) -> pd.DataFrame:
    df = df.copy()
    df["dow"] = df["date"].dt.dayofweek
    df["month"] = df["date"].dt.month
    df["week_of_year"] = df["date"].dt.isocalendar().week.astype(int)
    df["is_weekend"] = df["dow"].isin([5, 6]).astype(int)

    # Lagged features per SKU
    for lag in [7, 14, 28]:
        df[f"lag_{lag}"] = df.groupby("sku")["sales"].shift(lag)

    # Rolling statistics
    for window in [7, 28]:
        df[f"rolling_mean_{window}"] = (
            df.groupby("sku")["sales"]
            .transform(lambda x: x.shift(1).rolling(window).mean())
        )
        df[f"rolling_std_{window}"] = (
            df.groupby("sku")["sales"]
            .transform(lambda x: x.shift(1).rolling(window).std())
        )

    return df

def train_global_model(train_df: pd.DataFrame, features: list[str]) -> lgb.Booster:
    dtrain = lgb.Dataset(train_df[features], label=train_df["sales"])
    params = {
        "objective": "tweedie",
        "tweedie_variance_power": 1.5,
        "learning_rate": 0.05,
        "num_leaves": 127,
        "min_data_in_leaf": 50,
        "feature_fraction": 0.8,
        "bagging_fraction": 0.8,
        "bagging_freq": 1,
        "verbose": -1,
    }
    return lgb.train(params, dtrain, num_boost_round=500)

The Tweedie objective handles the zero-inflated, right-skewed distribution typical of daily sales data better than squared error.

Hierarchical forecast reconciliation

Forecasts exist at multiple levels: total company, region, store, category, SKU. Independent forecasts at each level rarely add up consistently. Reconciliation forces coherence.

The MinT (Minimum Trace) approach minimizes the trace of the forecast error covariance matrix:

# Using the scikit-hts or hierarchicalforecast library
from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import MinTrace

reconciler = HierarchicalReconciliation(reconcilers=[MinTrace(method="ols")])
reconciled = reconciler.reconcile(base_forecasts, S_matrix, tags)

Bottom-up aggregation (forecast each SKU, sum up) is simpler but ignores information in aggregate trends. Top-down disaggregation (forecast total, split by proportions) loses local patterns. MinT blends both optimally.

Cold-start forecasting

New products have no history. Strategies:

Analog matching — find similar existing products (same category, price range, brand) and use their launch trajectory as a template.
Attribute-based model — train a model where features include product attributes (category, price, season of launch) rather than historical sales. The global LightGBM model handles this naturally if attributes are included as features.
Bayesian priors — start with the category average as a prior and update as sales data arrives. After 2-4 weeks, the model converges to product-specific estimates.

def cold_start_forecast(new_sku_attrs: dict, category_models: dict, days_ahead: int) -> np.ndarray:
    category = new_sku_attrs["category"]
    base_forecast = category_models[category].predict(days_ahead)
    # Scale by price ratio (cheaper products typically sell more units)
    price_ratio = category_models[category].avg_price / new_sku_attrs["price"]
    return base_forecast * np.clip(price_ratio, 0.3, 3.0)

Forecast monitoring and alerting

Track these metrics on a rolling basis:

Forecast bias — systematic over or under prediction. A consistently positive bias means you are ordering too much.
Weighted MAPE by revenue tier — catches degradation on high-impact SKUs first.
Prediction interval coverage — if your 95% intervals contain actual values less than 90% of the time, the model underestimates uncertainty.

def forecast_monitor(actuals: pd.Series, forecasts: pd.Series, upper: pd.Series, lower: pd.Series) -> dict:
    errors = actuals - forecasts
    return {
        "bias": errors.mean(),
        "mae": errors.abs().mean(),
        "mape": (errors.abs() / actuals.replace(0, np.nan)).mean(),
        "coverage_95": ((actuals >= lower) & (actuals <= upper)).mean(),
    }

Set alerts when bias exceeds ±10% of mean demand or when MAPE increases by more than 5 percentage points week-over-week.

Pitfalls from production systems

Leaking future information — including promotional flags that were decided after the forecast date. Always use feature values known at prediction time.
Over-fitting on promotions — a SKU with two historical promotions does not have enough signal to model promotional uplift reliably. Pool promotion effects across similar SKUs.
Ignoring substitution — when product A is out of stock, customers buy product B. Training on raw data attributes B’s spike to B’s demand rather than A’s overflow.
Monthly aggregation hiding weekly patterns — a product that sells exclusively on weekends looks smooth at monthly granularity. Forecast at the lowest actionable granularity.

The one thing to remember: Production demand forecasting chains data cleaning (especially out-of-stock censoring), global ML models that share signal across SKUs, hierarchical reconciliation for consistency, and continuous monitoring to catch drift before it damages inventory decisions.

pythondemand-forecastingsupply-chaintime-series