Causal Inference — Deep Dive
Pearl’s Do-Calculus
Judea Pearl’s structural causal model (SCM) framework formalizes causal reasoning using DAGs and the do-operator.
The do-operator $P(Y | do(X=x))$ represents the distribution of $Y$ when we intervene to set $X=x$ — different from the conditional distribution $P(Y | X=x)$ which includes selection effects.
Example: $P(\text{recovery} | \text{drug}=1)$ is the probability of recovery for people who chose to take the drug (includes selection — sicker people may be more likely to take the drug). $P(\text{recovery} | do(\text{drug}=1))$ is the probability of recovery if everyone took the drug — the causal effect.
The three rules of do-calculus (Pearl, 1995):
-
Rule 1 (Insertion/deletion of observations): When $Z$ blocks all paths from $Y$ to $X$ in a modified graph: $$P(Y | do(X), Z, W) = P(Y | do(X), W)$$
-
Rule 2 (Action/observation exchange): When $Z$ satisfies backdoor criterion for $X \rightarrow Y$: $$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)$$
-
Rule 3 (Insertion/deletion of actions): Under certain graph conditions: $$P(Y | do(X), do(Z), W) = P(Y | do(X), W)$$
Do-calculus is complete: any identifiable causal effect can be derived from observational data using these three rules applied to the DAG.
Backdoor and Frontdoor Criteria
Backdoor Criterion
A set of variables $Z$ satisfies the backdoor criterion for the causal effect of $X$ on $Y$ if:
- No node in $Z$ is a descendant of $X$
- $Z$ blocks all “backdoor paths” from $X$ to $Y$ (paths going through arrows pointing into $X$)
If the backdoor criterion is satisfied: $$P(Y | do(X)) = \sum_z P(Y | X, Z=z) P(Z=z)$$
This is the adjustment formula — condition on $Z$ and marginate out.
Frontdoor Criterion
Sometimes all backdoor paths can’t be blocked (unobserved confounders). The frontdoor criterion provides an alternative when there’s an observed variable $M$ on the causal path $X \rightarrow M \rightarrow Y$:
- $M$ is intercepted by all directed paths from $X$ to $Y$
- No backdoor path from $X$ to $M$ (or all backdoor paths are blocked by observed variables)
- All backdoor paths from $M$ to $Y$ blocked by $X$
Frontdoor formula: $$P(Y | do(X)) = \sum_m P(M=m | X) \sum_x P(Y | X=x, M=m) P(X=x)$$
Classic example: smoking ($X$) → tar deposits ($M$) → cancer ($Y$), with unobserved genetic confounder for smoking and cancer. The frontdoor formula identifies the causal smoking effect through the mediator.
Causal Forests and Heterogeneous Treatment Effects
Wager & Athey (2018) “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests” introduced causal forests — a non-parametric method for estimating individualized treatment effects $\tau(x) = E[Y(1) - Y(0) | X=x]$.
Double robustness via debiasing: Each tree computes residualized outcomes and residualized treatments: $$\tilde{Y}_i = Y_i - \hat{\mu}(X_i), \quad \tilde{T}_i = T_i - \hat{e}(X_i)$$
Where $\hat{\mu}$ and $\hat{e}$ are estimated on a separate subsample. Fitting a regression forest of $\tilde{Y}$ on $\tilde{T}$ estimates $\tau(x)$ robustly — valid if either the outcome model $\hat{\mu}$ or the propensity model $\hat{e}$ is correctly specified.
Honest trees: Use half the data to build the tree structure, other half to estimate leaf predictions. Prevents overfitting to sample noise in the treatment effect estimates.
The grf (Generalized Random Forests) R package implements causal forests with confidence intervals. Extensively used at tech companies for:
- Heterogeneous effects of promotions (which user segments respond to discounts?)
- Personalized medicine (which patients benefit most from a treatment?)
- Optimal policy design (when should we intervene?)
Double Machine Learning
Chernozhukov et al. (2018) “Double/Debiased Machine Learning” provides a general framework for semiparametric causal estimation using modern ML.
For a partially linear model $Y = \tau T + g(X) + \epsilon$:
Naive approach: Regress $Y$ on $T, X$. Problem: high-dimensional $X$ leads to regularization bias — the $\ell_1$ penalty shrinks the $\tau$ coefficient toward zero.
DML procedure:
- Fit $\hat{\mu}_T = E[T|X]$ using any ML method (cross-fitted)
- Fit $\hat{\mu}_Y = E[Y|X]$ using any ML method (cross-fitted)
- Partial out $X$: $\tilde{T}i = T_i - \hat{\mu}{T,-i}(X_i)$, $\tilde{Y}i = Y_i - \hat{\mu}{Y,-i}(X_i)$
- OLS regression of $\tilde{Y}$ on $\tilde{T}$: $\hat{\tau} = \frac{\tilde{T}^T \tilde{Y}}{\tilde{T}^T \tilde{T}}$
Cross-fitting (the $-i$ subscript) — fitting on all data except observation $i$, then predicting $i$ — removes the regularization bias. The OLS step in the orthogonalized space produces $\sqrt{n}$-consistent, asymptotically normal estimates despite the high-dimensional nuisance parameters.
DML enables causal effect estimation with provably valid confidence intervals even when the control variables $X$ have thousands of dimensions and ML models are used for nuisance parameter estimation.
Causal Discovery
All methods above assume the causal structure (DAG) is known. Causal discovery attempts to learn the DAG from observational data.
Constraint-based methods (PC algorithm, FCI): Use conditional independence tests to determine which edges to include. If $X \perp Y | Z$, there’s no direct causal edge between $X$ and $Y$ given $Z$. Asymptotically correct under faithfulness assumption.
Score-based methods (GES, NOTEARS): Score different DAGs and search for the highest-scoring one. NOTEARS (Zheng et al., 2018) formulated DAG learning as a continuous optimization problem: $$\min_W \ell(W) + \lambda |W|_1 \text{ s.t. } h(W) = 0$$
Where $h(W) = \text{tr}(e^{W \circ W}) - d = 0$ is a smooth characterization of the DAG constraint (acyclicity). This enables gradient-based optimization for causal discovery.
The identification problem: Different DAGs can imply the same conditional independence structure (a “Markov equivalence class”). Without additional assumptions or interventional data, you can identify the equivalence class but not the exact DAG.
LLMs for causal discovery: Recent work (2023–2024) uses LLM domain knowledge to constrain causal discovery:
- Ask LLM which causal relationships are plausible given domain knowledge
- Use these as soft priors in the causal discovery algorithm
- The algorithm searches within LLM-constrained equivalence classes
This combination shows promise for accelerating causal model construction in domain areas where LLMs have strong prior knowledge.
One thing to remember: Do-calculus revealed that causal inference has a complete formal logic — you can determine from a DAG exactly which causal effects are identifiable and how to compute them — transforming causal inference from an art into a systematic procedure.
See Also
- Ab Testing How tech companies run thousands of experiments at once to improve their products — the scientific method applied to everything from button colors to recommendation algorithms.
- Time Series Forecasting How AI predicts the future from patterns in the past — the technology behind weather forecasts, stock predictions, electricity demand, and your iPhone's battery charge estimate.
- Feature Engineering Why the way you describe your data to a machine learning model matters more than which model you choose — the art of turning raw data into something AI can actually learn from.