Stationarity Testing in Python — Deep Dive

Unit root econometrics in Python — ADF regression mechanics, Phillips-Perron corrections, structural breaks, and automated differencing pipelines.

The unit root problem formally

Consider the AR(1) process: yₜ = ρyₜ₋₁ + εₜ

If |ρ| < 1, the process is stationary. If ρ = 1, the process is a random walk (unit root). The Dickey-Fuller test directly estimates ρ by rewriting the equation:

Δyₜ = (ρ − 1)yₜ₋₁ + εₜ = γyₜ₋₁ + εₜ

The null hypothesis is γ = 0 (unit root). Under the null, the test statistic does not follow a standard t-distribution — it follows the Dickey-Fuller distribution, which has heavier left tails.

The Augmented version (ADF)

The ADF test adds lagged differences to control for serial correlation:

Δyₜ = α + βt + γyₜ₋₁ + Σᵢ δᵢΔyₜ₋ᵢ + εₜ

Where α is a constant, βt is a time trend, and the lagged differences absorb autocorrelation so that εₜ is approximately white noise.

from statsmodels.tsa.stattools import adfuller

# With constant only (most common)
result_c = adfuller(series.dropna(), regression="c", autolag="AIC")

# With constant and trend
result_ct = adfuller(series.dropna(), regression="ct", autolag="AIC")

# No constant, no trend (rare)
result_n = adfuller(series.dropna(), regression="n", autolag="AIC")

The choice of regression specification matters:

"c" — appropriate when the series fluctuates around a non-zero mean.
"ct" — appropriate when you suspect a deterministic trend.
"n" — appropriate only when the series has zero mean under the alternative.

Using the wrong specification reduces test power. When in doubt, use "c".

Lag selection

The autolag="AIC" option selects the number of augmenting lags by minimizing AIC. Alternatives include "BIC" (tends to select fewer lags) and "t-stat" (removes lags until the last one is significant). BIC is preferred for large samples; AIC for small ones.

Phillips-Perron test

The PP test is an alternative to ADF that uses a non-parametric correction for serial correlation instead of adding lagged differences:

from arch.unitroot import PhillipsPerron

pp = PhillipsPerron(series.dropna(), lags=None, trend="c")
print(f"PP Statistic: {pp.stat:.4f}")
print(f"p-value: {pp.pvalue:.4f}")

PP is more robust to heteroscedasticity and serial correlation of unknown form, but it can have worse finite-sample performance than ADF. In practice, ADF and PP usually agree.

KPSS test mechanics

KPSS decomposes the series as: yₜ = ξt + rₜ + εₜ

Where rₜ = rₜ₋₁ + uₜ is a random walk component. Under the null of stationarity, the variance of uₜ is zero (the random walk component vanishes). The test statistic is based on the cumulative sum of residuals from a regression:

from statsmodels.tsa.stattools import kpss

# Test for level stationarity
stat_c, p_c, lags_c, crit_c = kpss(series.dropna(), regression="c", nlags="auto")

# Test for trend stationarity
stat_ct, p_ct, lags_ct, crit_ct = kpss(series.dropna(), regression="ct", nlags="auto")

The nlags="auto" uses the Schwert criterion for bandwidth selection. The truncated p-values (KPSS only reports p between 0.01 and 0.10) are a known limitation.

Advanced: Zivot-Andrews test for structural breaks

Standard tests assume no structural breaks. If the series has a one-time level shift or trend change, ADF may fail to reject the unit root null even when the series is stationary around a break:

from arch.unitroot import ZivotAndrews

za = ZivotAndrews(series.dropna(), method="both", lags=None)
print(f"ZA Statistic: {za.stat:.4f}")
print(f"p-value: {za.pvalue:.4f}")
print(f"Breakpoint: {za.breakpoint}")  # index of detected break

The method parameter specifies what kind of break:

"const" — break in intercept only
"trend" — break in trend only
"both" — break in both intercept and trend

Variance stationarity and ARCH effects

The tests above focus on mean stationarity. A series can have a constant mean but time-varying variance (ARCH effects), which is common in financial returns:

from arch.unitroot import DFGLS
from statsmodels.stats.diagnostic import het_arch

# Test for mean stationarity with better power
dfgls = DFGLS(series.dropna(), trend="c", lags=None)
print(f"DF-GLS Statistic: {dfgls.stat:.4f}")

# Test for ARCH effects (variance non-stationarity)
resid = series.diff().dropna()
arch_stat, arch_p, _, _ = het_arch(resid, nlags=5)
print(f"ARCH test p-value: {arch_p:.4f}")
# p < 0.05 → variance is time-varying

DF-GLS is a more powerful version of ADF that detrends the data first using GLS, producing better finite-sample performance.

Comprehensive stationarity testing function

import pandas as pd
import numpy as np
from statsmodels.tsa.stattools import adfuller, kpss

def comprehensive_stationarity_test(series, name="series"):
    """Run ADF + KPSS with joint interpretation."""
    clean = series.dropna()
    
    if len(clean) < 20:
        return {"error": "Insufficient data (need ≥ 20 observations)"}
    
    # ADF test
    adf_stat, adf_p, adf_lags, adf_nobs, adf_crit, _ = adfuller(
        clean, regression="c", autolag="AIC"
    )
    
    # KPSS test
    kpss_stat, kpss_p, kpss_lags, kpss_crit = kpss(
        clean, regression="c", nlags="auto"
    )
    
    adf_stationary = adf_p < 0.05
    kpss_stationary = kpss_p > 0.05
    
    # Joint interpretation
    if adf_stationary and kpss_stationary:
        conclusion = "STATIONARY"
        action = "No transformation needed"
    elif not adf_stationary and not kpss_stationary:
        conclusion = "NON-STATIONARY"
        action = "Apply differencing (d=1)"
    elif adf_stationary and not kpss_stationary:
        conclusion = "TREND-STATIONARY"
        action = "Remove deterministic trend or use regression='ct' in KPSS"
    else:
        conclusion = "INCONCLUSIVE"
        action = "Try larger sample, different lag selection, or PP test"
    
    return {
        "series": name,
        "n_obs": len(clean),
        "adf_statistic": round(adf_stat, 4),
        "adf_p_value": round(adf_p, 4),
        "adf_stationary": adf_stationary,
        "kpss_statistic": round(kpss_stat, 4),
        "kpss_p_value": round(kpss_p, 4),
        "kpss_stationary": kpss_stationary,
        "conclusion": conclusion,
        "recommended_action": action,
    }

Automated differencing pipeline

def auto_difference(series, max_d=2, max_D=1, seasonal_period=None):
    """Automatically determine and apply differencing."""
    result = {"original_length": len(series)}
    current = series.copy()
    
    # Seasonal differencing first (if applicable)
    D = 0
    if seasonal_period and len(current) > 2 * seasonal_period:
        for D_candidate in range(1, max_D + 1):
            test_series = current.diff(seasonal_period).dropna()
            kpss_stat, kpss_p, *_ = kpss(test_series, regression="c", nlags="auto")
            if kpss_p > 0.05:
                current = test_series
                D = D_candidate
                break
        else:
            current = current.diff(seasonal_period).dropna()
            D = max_D
    
    # Non-seasonal differencing
    d = 0
    for d_candidate in range(max_d + 1):
        test_series = current.copy()
        for _ in range(d_candidate):
            test_series = test_series.diff().dropna()
        
        adf_p = adfuller(test_series, autolag="AIC")[1]
        if adf_p < 0.05:
            for _ in range(d_candidate):
                current = current.diff().dropna()
            d = d_candidate
            break
    
    result.update({
        "d": d,
        "D": D,
        "seasonal_period": seasonal_period,
        "final_length": len(current),
        "series": current,
    })
    
    return result

Pitfalls and edge cases

Sample size sensitivity — ADF has low power with short series (< 50 observations). The test may fail to reject the null even for stationary series. Consider using DF-GLS for better power.
Near-unit root processes — A series with ρ = 0.99 is technically stationary but behaves like a random walk over finite samples. Tests may give contradictory results.
Structural breaks — A stationary series with a level shift can appear non-stationary to ADF. Always plot the data and consider the Zivot-Andrews test.
Multiple testing — Running many stationarity tests increases the chance of false positives. If testing hundreds of series, apply Bonferroni or BH corrections.
Seasonal unit roots — Standard ADF does not detect seasonal unit roots. Use the OCSB or Canova-Hansen test from pmdarima:

from pmdarima.arima import nsdiffs

D = nsdiffs(series, m=12, test="ocsb")

The one thing to remember: Stationarity testing is a statistical inference problem with real consequences for model selection — using the wrong test specification, ignoring structural breaks, or over-differencing can all lead to models that look good on paper but produce useless forecasts.

pythontime-seriesstationaritystatistics