Statsmodels for Regression — Deep Dive

OLS regression in practice

Formula API vs. array API

Statsmodels provides two interfaces. The formula API (using R-style formulas) is more readable; the array API gives more control.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Generate sample data
np.random.seed(42)
n = 200
data = pd.DataFrame({
    "temperature": np.random.uniform(15, 40, n),
    "weekend": np.random.binomial(1, 0.3, n),
    "advertising": np.random.uniform(0, 500, n),
})
data["sales"] = (
    50 + 3.5 * data["temperature"] + 20 * data["weekend"]
    + 0.1 * data["advertising"] + np.random.normal(0, 15, n)
)

# Formula API — readable, automatic intercept
model_formula = smf.ols("sales ~ temperature + weekend + advertising", data=data)
results_formula = model_formula.fit()
print(results_formula.summary())

# Array API — explicit, more control
X = data[["temperature", "weekend", "advertising"]]
X = sm.add_constant(X)  # Must add intercept manually
y = data["sales"]

model_array = sm.OLS(y, X)
results_array = model_array.fit()

Extracting results programmatically

# Coefficients
print(results_formula.params)

# P-values
print(results_formula.pvalues)

# Confidence intervals
print(results_formula.conf_int(alpha=0.05))

# R-squared and adjusted R-squared
print(f"R²: {results_formula.rsquared:.4f}")
print(f"Adj R²: {results_formula.rsquared_adj:.4f}")

# F-statistic
print(f"F-statistic: {results_formula.fvalue:.2f}")
print(f"F p-value: {results_formula.f_pvalue:.4e}")

# AIC and BIC for model comparison
print(f"AIC: {results_formula.aic:.1f}")
print(f"BIC: {results_formula.bic:.1f}")

Interaction terms and polynomial regression

import statsmodels.formula.api as smf

# Interaction: does the effect of temperature depend on weekday/weekend?
model_interact = smf.ols(
    "sales ~ temperature * weekend + advertising", data=data
).fit()

print(model_interact.summary())
# The temperature:weekend coefficient tells you how much the temperature
# effect changes on weekends vs. weekdays

# Polynomial: capturing nonlinear relationships
data["temp_sq"] = data["temperature"] ** 2

model_poly = smf.ols(
    "sales ~ temperature + temp_sq + weekend + advertising", data=data
).fit()

# Or using the formula shorthand (I() for arithmetic in formulas)
model_poly2 = smf.ols(
    "sales ~ temperature + I(temperature**2) + weekend + advertising", data=data
).fit()

Diagnostic checks

Residual analysis

import matplotlib.pyplot as plt
import statsmodels.api as sm

results = results_formula

# 1. Residuals vs. fitted (check for patterns)
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

axes[0, 0].scatter(results.fittedvalues, results.resid, alpha=0.5, s=20)
axes[0, 0].axhline(y=0, color="red", linestyle="--")
axes[0, 0].set_xlabel("Fitted Values")
axes[0, 0].set_ylabel("Residuals")
axes[0, 0].set_title("Residuals vs Fitted")

# 2. Q-Q plot (check normality of residuals)
sm.qqplot(results.resid, line="45", ax=axes[0, 1])
axes[0, 1].set_title("Q-Q Plot")

# 3. Scale-location plot (check homoscedasticity)
standardized_resid = results.get_influence().resid_studentized_internal
axes[1, 0].scatter(results.fittedvalues, np.sqrt(np.abs(standardized_resid)), alpha=0.5, s=20)
axes[1, 0].set_xlabel("Fitted Values")
axes[1, 0].set_ylabel("√|Standardized Residuals|")
axes[1, 0].set_title("Scale-Location")

# 4. Residuals vs. leverage (identify influential points)
sm.graphics.influence_plot(results, ax=axes[1, 1], criterion="cooks")
axes[1, 1].set_title("Residuals vs Leverage")

plt.tight_layout()
plt.savefig("regression_diagnostics.png", dpi=150)

Formal diagnostic tests

from statsmodels.stats.diagnostic import (
    het_breuschpagan,
    het_white,
    acorr_breusch_godfrey,
)
from statsmodels.stats.stattools import durbin_watson

# Breusch-Pagan test for heteroscedasticity
bp_stat, bp_p, _, _ = het_breuschpagan(results.resid, results.model.exog)
print(f"Breusch-Pagan p-value: {bp_p:.4f}")
# p < 0.05 → heteroscedasticity detected

# White's test (more general)
white_stat, white_p, _, _ = het_white(results.resid, results.model.exog)
print(f"White test p-value: {white_p:.4f}")

# Durbin-Watson for autocorrelation
dw = durbin_watson(results.resid)
print(f"Durbin-Watson: {dw:.3f}")
# ≈ 2.0 → no autocorrelation
# < 1.5 → positive autocorrelation
# > 2.5 → negative autocorrelation

Handling heteroscedasticity

When variance of residuals is not constant, standard errors are biased. Two approaches:

Robust standard errors (HC)

import statsmodels.formula.api as smf

model = smf.ols("sales ~ temperature + weekend + advertising", data=data)

# HC3 robust standard errors (recommended for small samples)
results_robust = model.fit(cov_type="HC3")
print(results_robust.summary())

# Compare standard vs. robust standard errors
results_normal = model.fit()
comparison = pd.DataFrame({
    "Normal SE": results_normal.bse,
    "Robust SE": results_robust.bse,
    "Normal p": results_normal.pvalues,
    "Robust p": results_robust.pvalues,
})
print(comparison)

Weighted Least Squares (WLS)

import statsmodels.api as sm

# If you know (or estimate) the variance function
# Example: variance proportional to temperature
weights = 1.0 / data["temperature"]

X = sm.add_constant(data[["temperature", "weekend", "advertising"]])
model_wls = sm.WLS(data["sales"], X, weights=weights)
results_wls = model_wls.fit()
print(results_wls.summary())

Generalized Linear Models

Logistic regression

import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

# Binary outcome: purchase or not
np.random.seed(42)
n = 500
purchase_data = pd.DataFrame({
    "price": np.random.uniform(10, 100, n),
    "age": np.random.uniform(18, 70, n),
    "returning_customer": np.random.binomial(1, 0.4, n),
})

logit_score = -2 + 0.05 * purchase_data["age"] - 0.03 * purchase_data["price"] + 0.8 * purchase_data["returning_customer"]
purchase_data["purchased"] = (np.random.random(n) < 1 / (1 + np.exp(-logit_score))).astype(int)

logit_model = smf.logit(
    "purchased ~ price + age + returning_customer", data=purchase_data
).fit()

print(logit_model.summary())

# Odds ratios (exponentiate coefficients)
odds_ratios = np.exp(logit_model.params)
print("\nOdds Ratios:")
print(odds_ratios)
# OR > 1: increases odds of purchase
# OR < 1: decreases odds of purchase

Poisson regression

import statsmodels.formula.api as smf
import numpy as np
import pandas as pd

np.random.seed(42)
n = 300
count_data = pd.DataFrame({
    "traffic_volume": np.random.uniform(1000, 10000, n),
    "speed_limit": np.random.choice([30, 50, 70], n),
    "has_signal": np.random.binomial(1, 0.5, n),
})

rate = np.exp(-3 + 0.0003 * count_data["traffic_volume"]
              + 0.02 * count_data["speed_limit"]
              - 0.5 * count_data["has_signal"])
count_data["accidents"] = np.random.poisson(rate)

poisson_model = smf.poisson(
    "accidents ~ traffic_volume + speed_limit + has_signal", data=count_data
).fit()

print(poisson_model.summary())

# Incidence rate ratios
irr = np.exp(poisson_model.params)
print("\nIncidence Rate Ratios:")
print(irr)

Time series regression

import statsmodels.api as sm
import pandas as pd
import numpy as np

# ARIMA-style analysis
np.random.seed(42)
dates = pd.date_range("2020-01-01", periods=365, freq="D")
trend = np.arange(365) * 0.1
seasonal = 10 * np.sin(2 * np.pi * np.arange(365) / 365)
noise = np.random.normal(0, 3, 365)
ts = pd.Series(50 + trend + seasonal + noise, index=dates)

# Decompose
decomposition = sm.tsa.seasonal_decompose(ts, period=30)

# OLS with Newey-West standard errors (corrects for autocorrelation)
X = sm.add_constant(np.arange(len(ts)))
model = sm.OLS(ts.values, X)
results_nw = model.fit(cov_type="HAC", cov_kwds={"maxlags": 10})
print(results_nw.summary())

Model comparison and selection

import statsmodels.formula.api as smf

# Nested model comparison using likelihood ratio test
model_full = smf.ols("sales ~ temperature + weekend + advertising", data=data).fit()
model_reduced = smf.ols("sales ~ temperature", data=data).fit()

# Likelihood ratio test
from scipy import stats

lr_stat = -2 * (model_reduced.llf - model_full.llf)
df_diff = model_full.df_model - model_reduced.df_model
p_value = 1 - stats.chi2.cdf(lr_stat, df_diff)
print(f"LR statistic: {lr_stat:.2f}, p-value: {p_value:.4e}")

# AIC/BIC comparison
models = {
    "temperature only": smf.ols("sales ~ temperature", data=data).fit(),
    "temp + weekend": smf.ols("sales ~ temperature + weekend", data=data).fit(),
    "full": smf.ols("sales ~ temperature + weekend + advertising", data=data).fit(),
}

comparison = pd.DataFrame({
    name: {"AIC": m.aic, "BIC": m.bic, "R²": m.rsquared_adj}
    for name, m in models.items()
}).T
print(comparison)

Multicollinearity detection

from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm

X = sm.add_constant(data[["temperature", "weekend", "advertising"]])

vif_data = pd.DataFrame({
    "Variable": X.columns,
    "VIF": [variance_inflation_factor(X.values, i) for i in range(X.shape[1])],
})
print(vif_data)
# VIF > 5: moderate multicollinearity
# VIF > 10: severe — consider dropping or combining variables

Performance patterns

Dataset sizeNotes
< 10k rowsAll methods fast, use freely
10k - 100kOLS still fast; GLMs may take seconds
100k - 1MUse array API over formula API; formula parsing has overhead
> 1MConsider regularized models or switch to scikit-learn for speed

Large dataset optimization

# For very large datasets, the formula API adds parsing overhead
# Use the array API directly
X = data[["temperature", "weekend", "advertising"]].values
X = np.column_stack([np.ones(len(X)), X])  # Add intercept
y = data["sales"].values

model = sm.OLS(y, X)
results = model.fit()

Common pitfalls

  1. Forgetting to add a constant. The array API does not include an intercept by default. Always use sm.add_constant() unless you deliberately want to force the line through the origin.
  2. Interpreting R-squared as model quality. A low R-squared does not mean the model is bad — it means the outcome has high natural variability. A high R-squared does not mean the model is correct — it might be overfitting or omitting a confounding variable.
  3. Using p-values for prediction. P-values answer “is this effect real?” not “is this prediction accurate?” For prediction, use cross-validation and out-of-sample metrics.
  4. Ignoring multicollinearity. When predictors are highly correlated, individual coefficients become unstable even if the model as a whole fits well. Check VIF before interpreting individual coefficients.

The one thing to remember: Statsmodels is Python’s answer to R’s regression capabilities — it provides not just the model fit but the full inferential toolkit (p-values, diagnostics, robust standard errors) that separates exploratory analysis from rigorous statistical conclusions.

pythonstatisticsdata-science

See Also