Causal Inference — Deep Dive

Pearl's do-calculus, backdoor and frontdoor criteria, causal forests for heterogeneous effects, double machine learning, and why causal discovery is the next frontier.

Pearl’s Do-Calculus

Judea Pearl’s structural causal model (SCM) framework formalizes causal reasoning using DAGs and the do-operator.

The do-operator $P(Y | do(X=x))$ represents the distribution of $Y$ when we intervene to set $X=x$ — different from the conditional distribution $P(Y | X=x)$ which includes selection effects.

Example: $P(\text{recovery} | \text{drug}=1)$ is the probability of recovery for people who chose to take the drug (includes selection — sicker people may be more likely to take the drug). $P(\text{recovery} | do(\text{drug}=1))$ is the probability of recovery if everyone took the drug — the causal effect.

The three rules of do-calculus (Pearl, 1995):

Rule 1 (Insertion/deletion of observations): When $Z$ blocks all paths from $Y$ to $X$ in a modified graph: $$P(Y | do(X), Z, W) = P(Y | do(X), W)$$
Rule 2 (Action/observation exchange): When $Z$ satisfies backdoor criterion for $X \rightarrow Y$: $$P(Y | do(X), do(Z), W) = P(Y | do(X), Z, W)$$
Rule 3 (Insertion/deletion of actions): Under certain graph conditions: $$P(Y | do(X), do(Z), W) = P(Y | do(X), W)$$

Do-calculus is complete: any identifiable causal effect can be derived from observational data using these three rules applied to the DAG.

Backdoor and Frontdoor Criteria

Backdoor Criterion

A set of variables $Z$ satisfies the backdoor criterion for the causal effect of $X$ on $Y$ if:

No node in $Z$ is a descendant of $X$
$Z$ blocks all “backdoor paths” from $X$ to $Y$ (paths going through arrows pointing into $X$)

If the backdoor criterion is satisfied: $$P(Y | do(X)) = \sum_z P(Y | X, Z=z) P(Z=z)$$

This is the adjustment formula — condition on $Z$ and marginate out.

Frontdoor Criterion

Sometimes all backdoor paths can’t be blocked (unobserved confounders). The frontdoor criterion provides an alternative when there’s an observed variable $M$ on the causal path $X \rightarrow M \rightarrow Y$:

$M$ is intercepted by all directed paths from $X$ to $Y$
No backdoor path from $X$ to $M$ (or all backdoor paths are blocked by observed variables)
All backdoor paths from $M$ to $Y$ blocked by $X$

Frontdoor formula: $$P(Y | do(X)) = \sum_m P(M=m | X) \sum_x P(Y | X=x, M=m) P(X=x)$$

Classic example: smoking ($X$) → tar deposits ($M$) → cancer ($Y$), with unobserved genetic confounder for smoking and cancer. The frontdoor formula identifies the causal smoking effect through the mediator.

Causal Forests and Heterogeneous Treatment Effects

Wager & Athey (2018) “Estimation and Inference of Heterogeneous Treatment Effects using Random Forests” introduced causal forests — a non-parametric method for estimating individualized treatment effects $\tau(x) = E[Y(1) - Y(0) | X=x]$.

Double robustness via debiasing: Each tree computes residualized outcomes and residualized treatments: $$\tilde{Y}_i = Y_i - \hat{\mu}(X_i), \quad \tilde{T}_i = T_i - \hat{e}(X_i)$$

Where $\hat{\mu}$ and $\hat{e}$ are estimated on a separate subsample. Fitting a regression forest of $\tilde{Y}$ on $\tilde{T}$ estimates $\tau(x)$ robustly — valid if either the outcome model $\hat{\mu}$ or the propensity model $\hat{e}$ is correctly specified.

Honest trees: Use half the data to build the tree structure, other half to estimate leaf predictions. Prevents overfitting to sample noise in the treatment effect estimates.

The grf (Generalized Random Forests) R package implements causal forests with confidence intervals. Extensively used at tech companies for:

Heterogeneous effects of promotions (which user segments respond to discounts?)
Personalized medicine (which patients benefit most from a treatment?)
Optimal policy design (when should we intervene?)

Double Machine Learning

Chernozhukov et al. (2018) “Double/Debiased Machine Learning” provides a general framework for semiparametric causal estimation using modern ML.

For a partially linear model $Y = \tau T + g(X) + \epsilon$:

Naive approach: Regress $Y$ on $T, X$. Problem: high-dimensional $X$ leads to regularization bias — the $\ell_1$ penalty shrinks the $\tau$ coefficient toward zero.

DML procedure:

Fit $\hat{\mu}_T = E[T|X]$ using any ML method (cross-fitted)
Fit $\hat{\mu}_Y = E[Y|X]$ using any ML method (cross-fitted)
Partial out $X$: $\tilde{T}i = T_i - \hat{\mu}{T,-i}(X_i)$, $\tilde{Y}i = Y_i - \hat{\mu}{Y,-i}(X_i)$
OLS regression of $\tilde{Y}$ on $\tilde{T}$: $\hat{\tau} = \frac{\tilde{T}^T \tilde{Y}}{\tilde{T}^T \tilde{T}}$

Cross-fitting (the $-i$ subscript) — fitting on all data except observation $i$, then predicting $i$ — removes the regularization bias. The OLS step in the orthogonalized space produces $\sqrt{n}$-consistent, asymptotically normal estimates despite the high-dimensional nuisance parameters.

DML enables causal effect estimation with provably valid confidence intervals even when the control variables $X$ have thousands of dimensions and ML models are used for nuisance parameter estimation.

Causal Discovery

All methods above assume the causal structure (DAG) is known. Causal discovery attempts to learn the DAG from observational data.

Constraint-based methods (PC algorithm, FCI): Use conditional independence tests to determine which edges to include. If $X \perp Y | Z$, there’s no direct causal edge between $X$ and $Y$ given $Z$. Asymptotically correct under faithfulness assumption.

Score-based methods (GES, NOTEARS): Score different DAGs and search for the highest-scoring one. NOTEARS (Zheng et al., 2018) formulated DAG learning as a continuous optimization problem: $$\min_W \ell(W) + \lambda |W|_1 \text{ s.t. } h(W) = 0$$

Where $h(W) = \text{tr}(e^{W \circ W}) - d = 0$ is a smooth characterization of the DAG constraint (acyclicity). This enables gradient-based optimization for causal discovery.

The identification problem: Different DAGs can imply the same conditional independence structure (a “Markov equivalence class”). Without additional assumptions or interventional data, you can identify the equivalence class but not the exact DAG.

LLMs for causal discovery: Recent work (2023–2024) uses LLM domain knowledge to constrain causal discovery:

Ask LLM which causal relationships are plausible given domain knowledge
Use these as soft priors in the causal discovery algorithm
The algorithm searches within LLM-constrained equivalence classes

This combination shows promise for accelerating causal model construction in domain areas where LLMs have strong prior knowledge.

One thing to remember: Do-calculus revealed that causal inference has a complete formal logic — you can determine from a DAG exactly which causal effects are identifiable and how to compute them — transforming causal inference from an art into a systematic procedure.

causal-inferencedo-calculuscausal-forestsdouble-mlcausal-discoverydag