Statsmodels for Regression — Deep Dive

Master Statsmodels OLS, GLMs, diagnostics, heteroscedasticity corrections, and advanced regression techniques for rigorous statistical analysis.

OLS regression in practice

Formula API vs. array API

Statsmodels provides two interfaces. The formula API (using R-style formulas) is more readable; the array API gives more control.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Generate sample data
np.random.seed(42)
n = 200
data = pd.DataFrame({
    "temperature": np.random.uniform(15, 40, n),
    "weekend": np.random.binomial(1, 0.3, n),
    "advertising": np.random.uniform(0, 500, n),
})
data["sales"] = (
    50 + 3.5 * data["temperature"] + 20 * data["weekend"]
    + 0.1 * data["advertising"] + np.random.normal(0, 15, n)
)

# Formula API — readable, automatic intercept
model_formula = smf.ols("sales ~ temperature + weekend + advertising", data=data)
results_formula = model_formula.fit()
print(results_formula.summary())

# Array API — explicit, more control
X = data[["temperature", "weekend", "advertising"]]
X = sm.add_constant(X)  # Must add intercept manually
y = data["sales"]

model_array = sm.OLS(y, X)
results_array = model_array.fit()

Extracting results programmatically

# Coefficients
print(results_formula.params)

# P-values
print(results_formula.pvalues)

# Confidence intervals
print(results_formula.conf_int(alpha=0.05))

# R-squared and adjusted R-squared
print(f"R²: {results_formula.rsquared:.4f}")
print(f"Adj R²: {results_formula.rsquared_adj:.4f}")

# F-statistic
print(f"F-statistic: {results_formula.fvalue:.2f}")
print(f"F p-value: {results_formula.f_pvalue:.4e}")

# AIC and BIC for model comparison
print(f"AIC: {results_formula.aic:.1f}")
print(f"BIC: {results_formula.bic:.1f}")

Interaction terms and polynomial regression

import statsmodels.formula.api as smf

# Interaction: does the effect of temperature depend on weekday/weekend?
model_interact = smf.ols(
    "sales ~ temperature * weekend + advertising", data=data
).fit()

print(model_interact.summary())
# The temperature:weekend coefficient tells you how much the temperature
# effect changes on weekends vs. weekdays

# Polynomial: capturing nonlinear relationships
data["temp_sq"] = data["temperature"] ** 2

model_poly = smf.ols(
    "sales ~ temperature + temp_sq + weekend + advertising", data=data
).fit()

# Or using the formula shorthand (I() for arithmetic in formulas)
model_poly2 = smf.ols(
    "sales ~ temperature + I(temperature**2) + weekend + advertising", data=data
).fit()

Diagnostic checks

Residual analysis

import matplotlib.pyplot as plt
import statsmodels.api as sm

results = results_formula

# 1. Residuals vs. fitted (check for patterns)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.5, s=20)
axes[0, 0].axhline(y=0, color="red", linestyle="--")
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs Fitted")

# 2. Q-Q plot (check normality of residuals)
sm.qqplot(results.resid, line="45", ax=axes[0, 1])
axes[0, 1].set_title("Q-Q Plot")

# 3. Scale-location plot (check homoscedasticity)
standardized_resid = results.get_influence().resid_studentized_internal
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5, s=20)
axes[1, 0].set_xlabel("Fitted Values")
axes[1, 0].set_ylabel("√|Standardized Residuals|")
axes[1, 0].set_title("Scale-Location")

# 4. Residuals vs. leverage (identify influential points)
sm.graphics.influence_plot(results, ax=axes[1, 1], criterion="cooks")
axes[1, 1].set_title("Residuals vs Leverage")

plt.tight_layout()
plt.savefig("regression_diagnostics.png", dpi=150)

Formal diagnostic tests

from statsmodels.stats.diagnostic import (
    het_breuschpagan,
    het_white,
    acorr_breusch_godfrey,
)
from statsmodels.stats.stattools import durbin_watson

# Breusch-Pagan test for heteroscedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {bp_p:.4f}")
# p < 0.05 → heteroscedasticity detected

# White's test (more general)
white_stat, white_p, _, _ = het_white(results.resid, results.model.exog)
print(f"White test p-value: {white_p:.4f}")

# Durbin-Watson for autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw:.3f}")
# ≈ 2.0 → no autocorrelation
# < 1.5 → positive autocorrelation
# > 2.5 → negative autocorrelation

Handling heteroscedasticity

When variance of residuals is not constant, standard errors are biased. Two approaches:

Robust standard errors (HC)

import statsmodels.formula.api as smf

model = smf.ols("sales ~ temperature + weekend + advertising", data=data)

# HC3 robust standard errors (recommended for small samples)
results_robust = model.fit(cov_type="HC3")
print(results_robust.summary())

# Compare standard vs. robust standard errors
results_normal = model.fit()
comparison = pd.DataFrame({
    "Normal SE": results_normal.bse,
    "Robust SE": results_robust.bse,
    "Normal p": results_normal.pvalues,
    "Robust p": results_robust.pvalues,
})
print(comparison)

Weighted Least Squares (WLS)

import statsmodels.api as sm

# If you know (or estimate) the variance function
# Example: variance proportional to temperature
weights = 1.0 / data["temperature"]

X = sm.add_constant(data[["temperature", "weekend", "advertising"]])
model_wls = sm.WLS(data["sales"], X, weights=weights)
results_wls = model_wls.fit()
print(results_wls.summary())

Generalized Linear Models

Logistic regression

import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

# Binary outcome: purchase or not
np.random.seed(42)
n = 500
purchase_data = pd.DataFrame({
    "price": np.random.uniform(10, 100, n),
    "age": np.random.uniform(18, 70, n),
    "returning_customer": np.random.binomial(1, 0.4, n),
})

logit_score = -2 + 0.05 * purchase_data["age"] - 0.03 * purchase_data["price"] + 0.8 * purchase_data["returning_customer"]
purchase_data["purchased"] = (np.random.random(n) < 1 / (1 + np.exp(-logit_score))).astype(int)

logit_model = smf.logit(
    "purchased ~ price + age + returning_customer", data=purchase_data
).fit()

print(logit_model.summary())

# Odds ratios (exponentiate coefficients)
odds_ratios = np.exp(logit_model.params)
print("\nOdds Ratios:")
print(odds_ratios)
# OR > 1: increases odds of purchase
# OR < 1: decreases odds of purchase

Poisson regression

import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

np.random.seed(42)
n = 300
count_data = pd.DataFrame({
    "traffic_volume": np.random.uniform(1000, 10000, n),
    "speed_limit": np.random.choice([30, 50, 70], n),
    "has_signal": np.random.binomial(1, 0.5, n),
})

rate = np.exp(-3 + 0.0003 * count_data["traffic_volume"]
              + 0.02 * count_data["speed_limit"]
              - 0.5 * count_data["has_signal"])
count_data["accidents"] = np.random.poisson(rate)

poisson_model = smf.poisson(
    "accidents ~ traffic_volume + speed_limit + has_signal", data=count_data
).fit()

print(poisson_model.summary())

# Incidence rate ratios
irr = np.exp(poisson_model.params)
print("\nIncidence Rate Ratios:")
print(irr)

Time series regression

import statsmodels.api as sm
import pandas as pd
import numpy as np

# ARIMA-style analysis
np.random.seed(42)
dates = pd.date_range("2020-01-01", periods=365, freq="D")
trend = np.arange(365) * 0.1
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365)
noise = np.random.normal(0, 3, 365)
ts = pd.Series(50 + trend + seasonal + noise, index=dates)

# Decompose
decomposition = sm.tsa.seasonal_decompose(ts, period=30)

# OLS with Newey-West standard errors (corrects for autocorrelation)
X = sm.add_constant(np.arange(len(ts)))
model = sm.OLS(ts.values, X)
results_nw = model.fit(cov_type="HAC", cov_kwds={"maxlags": 10})
print(results_nw.summary())

Model comparison and selection

import statsmodels.formula.api as smf

# Nested model comparison using likelihood ratio test
model_full = smf.ols("sales ~ temperature + weekend + advertising", data=data).fit()
model_reduced = smf.ols("sales ~ temperature", data=data).fit()

# Likelihood ratio test
from scipy import stats

lr_stat = -2 * (model_reduced.llf - model_full.llf)
df_diff = model_full.df_model - model_reduced.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df_diff)
print(f"LR statistic: {lr_stat:.2f}, p-value: {p_value:.4e}")

# AIC/BIC comparison
models = {
    "temperature only": smf.ols("sales ~ temperature", data=data).fit(),
    "temp + weekend": smf.ols("sales ~ temperature + weekend", data=data).fit(),
    "full": smf.ols("sales ~ temperature + weekend + advertising", data=data).fit(),
}

comparison = pd.DataFrame({
    name: {"AIC": m.aic, "BIC": m.bic, "R²": m.rsquared_adj}
    for name, m in models.items()
}).T
print(comparison)

Multicollinearity detection

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

X = sm.add_constant(data[["temperature", "weekend", "advertising"]])

vif_data = pd.DataFrame({
    "Variable": X.columns,
    "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
})
print(vif_data)
# VIF > 5: moderate multicollinearity
# VIF > 10: severe — consider dropping or combining variables

Performance patterns

Dataset size	Notes
< 10k rows	All methods fast, use freely
10k - 100k	OLS still fast; GLMs may take seconds
100k - 1M	Use array API over formula API; formula parsing has overhead
> 1M	Consider regularized models or switch to scikit-learn for speed

Large dataset optimization

# For very large datasets, the formula API adds parsing overhead
# Use the array API directly
X = data[["temperature", "weekend", "advertising"]].values
X = np.column_stack([np.ones(len(X)), X])  # Add intercept
y = data["sales"].values

model = sm.OLS(y, X)
results = model.fit()

Common pitfalls

Forgetting to add a constant. The array API does not include an intercept by default. Always use sm.add_constant() unless you deliberately want to force the line through the origin.
Interpreting R-squared as model quality. A low R-squared does not mean the model is bad — it means the outcome has high natural variability. A high R-squared does not mean the model is correct — it might be overfitting or omitting a confounding variable.
Using p-values for prediction. P-values answer “is this effect real?” not “is this prediction accurate?” For prediction, use cross-validation and out-of-sample metrics.
Ignoring multicollinearity. When predictors are highly correlated, individual coefficients become unstable even if the model as a whole fits well. Check VIF before interpreting individual coefficients.

The one thing to remember: Statsmodels is Python’s answer to R’s regression capabilities — it provides not just the model fit but the full inferential toolkit (p-values, diagnostics, robust standard errors) that separates exploratory analysis from rigorous statistical conclusions.

pythonstatisticsdata-science