Causal Inference — Core Concepts

Potential outcomes framework, confounder adjustment, difference-in-differences, instrumental variables, and why causal thinking is transforming how data scientists measure impact.

The Potential Outcomes Framework

Donald Rubin’s potential outcomes framework (1974) formalizes what we mean by “causal effect.”

For each unit $i$ (a person, a company, a city):

$Y_i(1)$: the outcome if they receive treatment
$Y_i(0)$: the outcome if they don’t receive treatment
$\tau_i = Y_i(1) - Y_i(0)$: the individual treatment effect

The fundamental problem of causal inference: you can never observe both $Y_i(1)$ and $Y_i(0)$ for the same unit. You either treat someone or you don’t. The counterfactual is never observed.

Average Treatment Effect (ATE): $\tau_{ATE} = E[Y(1) - Y(0)]$ — the average causal effect across the population.

ATT (Average Treatment Effect on the Treated): $E[Y(1) - Y(0) | T=1]$ — the effect only for those who actually received treatment. Often more relevant than ATE for policy questions.

Confounders: Why Observational Studies Go Wrong

A confounder is a variable that causally affects both the treatment and the outcome. Failing to account for it produces biased causal estimates.

Classic example — does education cause higher earnings?

People with more education tend to earn more. But people with more education also tend to come from wealthier families (better resources, networks, expectations). Wealth affects both education access (treatment) and earnings (outcome). Wealth is a confounder.

A naive comparison (educated vs. not educated) conflates the causal effect of education with selection bias from family wealth.

Causal directed acyclic graphs (DAGs): A graph where arrows represent causal relationships. Identify confounders as nodes with paths to both treatment and outcome. Adjust for these confounders (via regression, matching, or reweighting) to estimate the causal effect.

Adjustment Methods

Regression adjustment: Include confounders as covariates in a regression model. The coefficient on treatment estimates the causal effect conditional on confounders. $$Y = \alpha + \beta T + \gamma X + \epsilon$$

Valid if: all confounders are measured and correctly included, model form is correct, and there are no unobserved confounders.

Propensity score matching: The propensity score $e(X) = P(T=1|X)$ is the probability of receiving treatment given covariates. Matching treated units to control units with similar propensity scores creates a “pseudo-randomized” comparison.

Rosenbaum & Rubin (1983): if all confounders are captured in $X$, conditioning on $e(X)$ removes all confounding bias. This enables 100-dimensional confounders to be balanced with a 1-dimensional score.

Inverse Probability Weighting (IPW): Weight treated units by $1/e(X)$ and control units by $1/(1-e(X))$. Creates a pseudo-population where treatment is independent of covariates. Doubly robust estimators combine regression and IPW — consistent if either model is correctly specified.

Quasi-Experimental Methods

When randomization isn’t possible but natural variation creates “near-random” assignment:

Difference-in-Differences (DiD)

Compare outcomes before and after a policy change, using a group not affected by the change as a control.

$$\hat{\tau}{DiD} = (\bar{Y}{treated, post} - \bar{Y}{treated, pre}) - (\bar{Y}{control, post} - \bar{Y}_{control, pre})$$

Classic application: Card & Krueger (1994) studied the effect of New Jersey’s minimum wage increase on fast food employment, using Pennsylvania (no wage increase) as the control. They found employment increased in NJ relative to PA — challenging classical economic predictions.

Key assumption: Parallel trends — in the absence of treatment, treated and control groups would have followed the same trend. Verifiable for pre-treatment periods; must be assumed for post-treatment.

Regression Discontinuity Design (RDD)

When treatment is assigned based on a threshold (test score ≥ 70 → scholarship; age ≥ 65 → Medicare), compare outcomes just above and just below the threshold.

Units just above and below the threshold are similar in all ways except treatment assignment. The discontinuity at the threshold estimates the causal effect.

Example: Angrist & Lavy (1999) used RDD to study class size effects in Israel — schools faced mandatory class size reductions when enrollment passed certain thresholds. The effect of smaller classes on test scores could be estimated from natural variation around these thresholds.

Instrumental Variables (IV)

An instrument $Z$ is a variable that:

Affects treatment $T$ (relevance)
Affects outcome $Y$ only through $T$ (exclusion restriction)
Is as-good-as-randomly assigned (independence)

Classic instrument: military draft lottery numbers as instrument for military service → effect of service on earnings. The lottery is random (satisfies 3), affects service probability (satisfies 1), and affects earnings only through service (arguably satisfies 2).

IV estimator: $\hat{\tau}_{IV} = \frac{\text{Cov}(Z, Y)}{\text{Cov}(Z, T)} = \frac{\text{Reduced form}}{\text{First stage}}$

LATE (Local Average Treatment Effect): IV estimates the causal effect only for “compliers” — units who change treatment status when the instrument changes. This may differ from ATE, which is the effect for everyone.

Causal Inference in Tech Companies

Google, Lyft, Doordash, and Netflix publish extensively on causal inference for measuring business impact:

Observational A/B testing: When true A/B tests are impossible (e.g., evaluating the impact of a sales team’s activities), DiD with synthetic controls estimates treatment effects from observational data.

Long-run effects: Standard A/B tests run 2–4 weeks. Long-run effects (subscription retention, habit formation) require causal inference from observational longitudinal data.

Heterogeneous treatment effects: Different users respond differently to features. Causal forests (Wager & Athey, 2018) estimate individualized treatment effects — useful for targeting personalized interventions.

One thing to remember: Causal inference is the discipline of answering “would things have been different if we had done X?” — and the tools (DiD, IV, RDD, propensity scores) are all clever ways to construct valid counterfactuals from observational data.

causal-inferencepotential-outcomesconfoundersdidivobservational-studies