Statsmodels for Regression — Core Concepts

Understand how Statsmodels performs OLS regression, interprets coefficients, evaluates model fit, and provides inference for statistical analysis.

Why Statsmodels matters

Python’s machine learning ecosystem (scikit-learn, XGBoost, PyTorch) excels at prediction — maximizing accuracy on new data. But prediction is not the only goal. Researchers, economists, and analysts often need to understand relationships: Does education affect income? Does advertising spend drive sales? Does a policy change reduce crime?

These questions require statistical inference — p-values, confidence intervals, hypothesis tests, and diagnostic checks. Statsmodels is the Python library built for this purpose. It provides the same statistical machinery available in R or Stata, wrapped in a Pythonic API that integrates with pandas and NumPy.

Core regression types

Ordinary Least Squares (OLS)

OLS finds the line (or hyperplane) that minimizes the sum of squared differences between predicted and actual values. It is the most common regression method and the starting point for nearly all regression analysis.

Key outputs:

Coefficients — how much the outcome changes for a one-unit change in each predictor
Standard errors — uncertainty around each coefficient
P-values — probability that the coefficient is actually zero (no effect)
R-squared — fraction of variance explained by the model (0 to 1)
Confidence intervals — range of plausible values for each coefficient

Generalized Linear Models (GLM)

When the outcome is not continuous — counts, binary yes/no, proportions — OLS assumptions break down. GLMs extend regression to handle:

Logistic regression — binary outcomes (purchase/no purchase)
Poisson regression — count data (number of accidents per month)
Negative binomial — overdispersed count data

Robust regression

When data contains outliers, OLS estimates get pulled toward them. Robust regression methods (like RLM in Statsmodels) downweight outliers automatically, giving more reliable estimates with dirty data.

Reading a regression summary

Statsmodels produces a summary table similar to R or Stata output. Here is what the key numbers mean:

R-squared — 0.75 means the model explains 75% of the variation. Higher is better, but a high R-squared does not guarantee the model is correct.
Adj. R-squared — R-squared adjusted for the number of predictors. Penalizes adding useless variables.
F-statistic — tests whether the model as a whole is significant. A low p-value (Prob F-statistic) means at least one predictor matters.
coef — the coefficient value. For a predictor “temperature,” a coef of 5.2 means each degree increase is associated with 5.2 more units of the outcome.
P>|t| — if below 0.05, the coefficient is statistically significant at the 5% level.
[0.025, 0.975] — 95% confidence interval for the coefficient.

OLS vs. scikit-learn LinearRegression

Both fit the same mathematical model. The difference is in what they give you back:

Feature	Statsmodels OLS	scikit-learn LinearRegression
Coefficients	Yes	Yes
P-values	Yes	No
Confidence intervals	Yes	No
R-squared	Yes	Yes (via `.score()`)
Residual diagnostics	Yes	No
Prediction	Yes	Yes
Pipeline integration	Limited	Excellent

Use Statsmodels when understanding why matters. Use scikit-learn when prediction accuracy on new data is the primary goal.

A typical workflow

Explore — Plot the data. Check distributions, outliers, and potential nonlinear relationships.
Specify — Choose dependent and independent variables. Consider transformations (log, polynomial).
Fit — Run the regression and examine the summary.
Diagnose — Check residuals for normality, homoscedasticity, and independence. Look for influential observations.
Refine — Remove non-significant variables, add interaction terms, or switch to a different model if diagnostics reveal problems.
Report — Present coefficients, confidence intervals, and model fit statistics.

Common misconception

Many people equate a statistically significant coefficient with a large or important effect. Statistical significance only means the effect is unlikely to be zero — it says nothing about the size of the effect. A coefficient can be tiny (barely practical) but highly significant with enough data. Always report effect sizes alongside p-values.

The one thing to remember: Statsmodels bridges the gap between fitting a model and understanding it — providing the p-values, confidence intervals, and diagnostic tools that let you defend your conclusions with statistical rigor.

pythonstatisticsdata-science