AI Ethics — Deep Dive

The Impossibility Theorem: Formal Treatment

Chouldechova (2017) and Kleinberg et al. (2016) proved that three fairness criteria are simultaneously satisfiable only when base rates are equal across groups or the classifier is perfect.

Let $Y$ be the true label, $\hat{Y}$ be the predicted label, $A$ be the sensitive attribute (group membership). Define:

  • Calibration: $P(Y=1|\hat{Y}=s, A=a)$ is the same for all groups $a$ at each score $s$
  • Balance for positive class: $E[\hat{Y}|Y=1, A=0] = E[\hat{Y}|Y=1, A=1]$ (equal TPR)
  • Balance for negative class: $E[\hat{Y}|Y=0, A=0] = E[\hat{Y}|Y=0, A=1]$ (equal FPR)

Theorem (Chouldechova): A classifier cannot simultaneously satisfy calibration, balance for the positive class, and balance for the negative class when $P(Y=1|A=0) \neq P(Y=1|A=1)$ and the classifier is not perfect.

The intuition: if two groups have different base rates (e.g., different historical recidivism rates), and the classifier is calibrated, then equating true positive rates across groups forces false positive rates to differ, and vice versa.

In the COMPAS case: Black defendants had higher historical recidivism rates than white defendants (a result of systemic inequities in policing, sentencing, and opportunity). A calibrated classifier will necessarily have asymmetric false positive or false negative rates. ProPublica highlighted the false positive disparity; Northpointe highlighted calibration. Neither was wrong.

Which fairness criterion matters most depends on the use case and values:

  • For medical screening: sensitivity (true positive rate) parity may be paramount — you don’t want to miss disease in any group
  • For criminal justice: false positive parity may be paramount — wrongly labeling someone as high-risk is a severe harm
  • For demographic parity: historical representation — ensuring equal opportunity regardless of group statistics

Causal Fairness

Statistical fairness criteria are silent about why a model makes different predictions for different groups. Causal approaches ask: is the sensitive attribute itself causally affecting the prediction, or is the model using a proxy that’s causally downstream of the attribute?

Counterfactual fairness (Kusner et al., 2017): A decision $\hat{Y}$ is counterfactually fair if, in a world where a person had a different value of the sensitive attribute $A$ (with everything else equal), they would receive the same decision.

$$P(\hat{Y}{A=a}(U) = y | X = x, A = a) = P(\hat{Y}{A=a’}(U) = y | X = x, A = a)$$

Where $U$ represents unobservable background factors. This requires a causal model — a DAG specifying how $A$, $X$, and $U$ causally relate.

Path-specific fairness: block only the causal paths from $A$ to $\hat{Y}$ that flow through “illegitimate” variables (directly or through proxies), while allowing effects through “legitimate” variables (e.g., a test score affected by education quality, not by race directly).

In practice, causal fairness requires strong assumptions about causal structure that are rarely verifiable from observational data alone — limiting practical application.

Explanation Methods: Theory and Limitations

SHAP values (Shapley values): For a prediction function $f$ and input $x$, the Shapley value of feature $i$ is:

$$\phi_i = \sum_{S \subseteq N \setminus {i}} \frac{|S|!(|N|-|S|-1)!}{|N|!} [f(S \cup {i}) - f(S)]$$

Where $N$ is the set of all features and $f(S)$ represents the model output using only feature subset $S$ (other features marginalized out).

Properties satisfied:

  • Efficiency: $\sum_i \phi_i = f(x) - E[f(x)]$ (contributions sum to the prediction’s deviation from baseline)
  • Symmetry: Features with equal marginal contributions have equal Shapley values
  • Dummy: Features with no marginal contribution receive zero Shapley value

Computing SHAP exactly is exponential in features. Tree SHAP (Lundberg & Lee, 2018) computes exact Shapley values for tree-based models in polynomial time. For neural networks, approximations (KernelSHAP, DeepSHAP) are used.

Limitations of SHAP: Shapley values are defined with respect to the model, not the data-generating process. A model that achieves high accuracy via a spurious correlation will explain its decision in terms of that spurious feature — the explanation is faithful to the model but misleading about the causal process.

Rudin (2019) argues strongly for intrinsically interpretable models (decision trees, sparse linear models) for high-stakes decisions rather than explaining black boxes. The explanation of a black box is inherently an approximation; an interpretable model’s explanation is exact.

Algorithmic Auditing in Practice

Pre-deployment auditing: Measure performance disparities across demographic groups on held-out test sets before deployment. Requires labeled demographic data (often not available) or proxy estimation.

Post-deployment monitoring: Track outcomes in production. Requires outcome data — often only available months after deployment (e.g., loan default rates, recidivism, hiring outcomes).

Red teaming for bias: Systematically probe model outputs for stereotypical associations. Widely used for language models. Stanford’s HELM benchmark (Holistic Evaluation of Language Models) includes bias and toxicity metrics.

Third-party audits: Independent assessment by researchers without conflicts of interest. AJL (Algorithmic Justice League), CSET, and AI Now Institute have conducted influential audits. The EU AI Act will require mandatory third-party assessments for some high-risk AI.

Value Alignment Approaches

Beyond technical fairness metrics, there’s a broader question of how to encode complex human values into AI systems.

Value learning (Inverse Reward Design, Russell et al.): Infer human values from behavior rather than specifying them explicitly. Humans reveal preferences through their choices; the AI learns a distribution over utility functions consistent with observed behavior. Advantage: avoids the specification problem. Disadvantage: behavior reveals stated preferences, not necessarily true preferences; humans are inconsistent.

Moral uncertainty (MacAskill et al.): AI systems should maintain calibrated uncertainty across ethical frameworks rather than committing to one. An action that’s clearly permissible under both consequentialist and deontological frameworks is safer than one that’s permissible under only one.

Participatory design: Rather than ethicists deciding what values AI should encode, involve affected communities in defining those values. Organizations like the Ada Lovelace Institute have developed methodologies for participatory AI design, especially for high-stakes applications in healthcare and criminal justice.

The fundamental challenge: value specification is a political process. Who gets to decide what’s fair? Whose historical data defines the base rates? Which communities’ participation is solicited? Technical approaches to fairness can obscure these political choices behind mathematical formalism.

One thing to remember: The mathematical impossibility of satisfying all fairness criteria simultaneously isn’t a technical problem to be solved — it’s a reflection of genuinely conflicting values in society, which means AI ethics inevitably requires making political choices about who bears which costs.

ai-ethicsalgorithmic-fairnesscausal-fairnesscompasshapeu-ai-act

See Also

  • Ai Safety Why some of the world's smartest people are worried about AI — and what researchers are actually doing about it before it becomes a problem.
  • Prompt Injection The security vulnerability where AI assistants can be hijacked by hidden instructions in documents they read — and why it's becoming a serious security problem.
  • Reward Modeling How AI learns what 'good' means — the training component that translates human preferences into a mathematical score that AI systems can optimize for.
  • Rlhf How ChatGPT learned to be helpful instead of just clever — the feedback loop that turned raw AI into something you'd actually want to talk to.
  • Activation Functions Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.