Autocorrelation Analysis in Python — Core Concepts

Master ACF and PACF plots in Python, interpret common patterns, and use autocorrelation to guide model selection.

ACF vs PACF — two different views

ACF (Autocorrelation Function)

The ACF measures the total correlation between a series and its lagged version. At lag k, it answers: “How correlated is yₜ with yₜ₋ₖ?”

This includes both direct and indirect effects. If lag 1 is strong and lag 2 is also strong, the lag 2 correlation might just be because yₜ is correlated with yₜ₋₁, which is correlated with yₜ₋₂.

from statsmodels.graphics.tsaplots import plot_acf
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(12, 4))
plot_acf(series.dropna(), lags=50, ax=ax)
plt.tight_layout()

PACF (Partial Autocorrelation Function)

The PACF strips out intermediate effects. At lag k, it answers: “How correlated is yₜ with yₜ₋ₖ after removing the effect of all lags in between?”

This isolates the direct relationship at each lag.

from statsmodels.graphics.tsaplots import plot_pacf

fig, ax = plt.subplots(figsize=(12, 4))
plot_pacf(series.dropna(), lags=50, ax=ax, method="ywm")

The method="ywm" (Yule-Walker modified) is more numerically stable than the default for some series.

Reading ACF/PACF plots

The shaded blue band represents the 95% confidence interval. Any bar that extends beyond this band is statistically significant.

Common patterns and what they mean

Gradual decay in ACF + sharp cutoff in PACF after lag p → AR(p) process. The series is best modeled with p autoregressive terms.

Sharp cutoff in ACF after lag q + gradual decay in PACF → MA(q) process. The series is best modeled with q moving average terms.

Both decay gradually → ARMA(p, q) process. Need both AR and MA terms.

Significant spikes at seasonal lags (7, 14, 21… or 12, 24, 36…) → Seasonal pattern present. Consider SARIMA or seasonal decomposition.

All lags significant and slowly decaying → Series is likely non-stationary. Difference first, then re-examine.

Computing autocorrelation numerically

from statsmodels.tsa.stattools import acf, pacf

# Compute ACF values with confidence intervals
acf_values, acf_confint = acf(series.dropna(), nlags=40, alpha=0.05)

# Compute PACF values
pacf_values, pacf_confint = pacf(series.dropna(), nlags=40, alpha=0.05)

# Find significant lags
import numpy as np
ci_width = (acf_confint[:, 1] - acf_confint[:, 0]) / 2
significant_lags = np.where(np.abs(acf_values) > ci_width)[0]
print(f"Significant ACF lags: {significant_lags}")

Using autocorrelation for model selection

A practical decision tree:

Plot ACF of the raw series. If it decays very slowly, the series is non-stationary. Difference it.
After differencing, plot ACF and PACF. Use the patterns above to identify candidate ARIMA orders.
Check seasonal lags. Spikes at multiples of the seasonal period indicate seasonal terms are needed.
Fit candidates and compare. Use AIC to choose among the models your ACF/PACF analysis suggested.

# Quick model identification helper
def identify_arima_order(series, max_lag=30):
    """Suggest ARIMA order from ACF/PACF analysis."""
    acf_vals = acf(series.dropna(), nlags=max_lag)
    pacf_vals = pacf(series.dropna(), nlags=max_lag)
    
    # Approximate significance bound
    n = len(series.dropna())
    bound = 1.96 / np.sqrt(n)
    
    sig_acf = [i for i in range(1, max_lag+1) if abs(acf_vals[i]) > bound]
    sig_pacf = [i for i in range(1, max_lag+1) if abs(pacf_vals[i]) > bound]
    
    return {
        "significant_acf_lags": sig_acf[:5],
        "significant_pacf_lags": sig_pacf[:5],
        "suggested_p": max(sig_pacf[:3]) if sig_pacf else 0,
        "suggested_q": max(sig_acf[:3]) if sig_acf else 0,
    }

Cross-correlation for two series

When analyzing the relationship between two time series (e.g., advertising spend and sales), use cross-correlation:

from statsmodels.tsa.stattools import ccf

# ccf(x, y) gives correlation of x_t with y_{t+k}
cross_corr = ccf(ad_spend, sales, adjusted=False)

# Positive lag k: ad_spend leads sales by k periods
# Negative lag k: sales leads ad_spend by k periods

This reveals lead-lag relationships — does increased advertising precede higher sales, and by how many periods?

Common misconception

Many people assume that high autocorrelation at lag 1 means the series is “predictable.” It does not. A random walk has perfect lag-1 autocorrelation in levels but is completely unpredictable. What matters is autocorrelation in the stationary (differenced) series. That is where predictable patterns live.

The one thing to remember: ACF and PACF plots are the diagnostic X-ray of time series analysis — they reveal the memory structure, seasonal patterns, and stationarity issues in your data, and reading them correctly is the skill that separates guess-and-check modeling from informed model selection.

pythontime-seriesautocorrelationstatistics