Bias-Variance Tradeoff — Core Concepts

The Bias-Variance Decomposition

For a regression problem with true function $f(x)$ and model $\hat{f}(x)$ trained on dataset $D$:

$$\mathbb{E}_D[(y - \hat{f}(x))^2] = \underbrace{[\mathbb{E}D[\hat{f}(x)] - f(x)]^2}{\text{Bias}^2} + \underbrace{\mathbb{E}D[(\hat{f}(x) - \mathbb{E}D[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible Noise}}$$

Where the expectation is over all possible training datasets of a given size.

Bias: How far is the average prediction from the truth? Systematic error from model assumptions.

Variance: How much do predictions vary across different training datasets? Sensitivity to noise in training data.

Irreducible noise $\sigma^2$: Inherent randomness in the data-generating process. Cannot be reduced regardless of model.

The goal: minimize bias² + variance (you can’t reduce irreducible noise). These two terms trade off as model complexity changes.

The Classical Complexity Tradeoff

As model complexity increases:

Underfitting (high bias): Low-degree polynomial, linear model on non-linear data.

  • High bias: systematically wrong predictions
  • Low variance: predictions stable across different training sets (the model structure prevents over-adapting)
  • Training error ≈ validation error (both high)

Optimal complexity: The “sweet spot” where bias and variance are balanced.

  • Training error < validation error (some overfitting)
  • Validation error minimized

Overfitting (high variance): High-degree polynomial, deep neural network on small dataset.

  • Low bias: can fit the training data very well
  • High variance: predictions change wildly with different training sets (sensitive to which noise was sampled)
  • Training error << validation error (large generalization gap)

Regularization: Directly Trading Bias for Variance

Regularization techniques explicitly add bias in exchange for reduced variance:

L2 regularization (Ridge): Add $\lambda \sum_i w_i^2$ to the loss. This shrinks all weights toward zero.

For linear regression, the L2-regularized solution: $$\hat{w} = (X^TX + \lambda I)^{-1} X^T y$$

Compared to OLS $\hat{w} = (X^TX)^{-1} X^T y$, the $\lambda I$ term regularizes by shrinking eigenvalues of $X^TX$ — weights corresponding to low-variance directions in the data (noise directions) are shrunk most aggressively.

L1 regularization (LASSO): Add $\lambda \sum_i |w_i|$ to the loss. Unlike L2, L1 drives some weights to exactly zero — feature selection.

Dropout: Randomly zero out neurons during training. The expected output is similar to an ensemble of many smaller networks. Reduces variance by preventing co-adaptation.

Early stopping: Stop training before validation error increases. Acts as implicit regularization — gradient descent initially moves toward lower bias, then starts overfitting; stopping early finds the optimal tradeoff point.

Double Descent: Modern Deep Learning Breaks the Rule

Classical bias-variance theory predicts a U-shaped generalization curve — performance improves, then worsens, with increasing complexity.

Belkin et al. (2019) “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off” observed a double descent curve:

At the interpolation threshold (model just large enough to fit training data exactly): high test error (classical overfitting peak).

Beyond the interpolation threshold: test error decreases again as model size continues increasing — even for models that fit training data exactly.

This seems paradoxical but has a resolution: over-parameterized models generalize well when they use the minimum norm interpolant. Among all functions that fit the training data, gradient descent on overparameterized models naturally converges to the simplest one (minimum norm / lowest complexity). This minimum norm interpolant generalizes better than the jagged, overfit solution at the interpolation threshold.

Practical implication: For modern large neural networks, increasing model size often improves generalization even when training loss is already near zero. The classical advice “smaller models generalize better” doesn’t hold at large scale.

Benign overfitting: Bartlett et al. (2020) proved that overparameterized linear models can achieve optimal generalization while interpolating noisy training data — under specific conditions on data and model structure. This provided theoretical foundation for why large models generalize.

Diagnosing Bias vs. Variance

High bias symptoms:

  • Training error is high (model can’t fit the training data well)
  • Validation error ≈ training error (they’re both high)
  • Adding more training data doesn’t help much

High variance symptoms:

  • Training error is very low
  • Validation error >> training error (large generalization gap)
  • Adding more training data significantly reduces validation error

Learning curves: Plot training and validation error vs. training set size. High bias: both curves plateau at high error. High variance: training error is low, validation error is high, they converge slowly as training size increases.

One thing to remember: The bias-variance tradeoff is the foundation of why regularization, early stopping, and ensemble methods work — but the double descent phenomenon shows that the classical tradeoff is an incomplete story, especially for the overparameterized neural networks that dominate modern AI.

bias-varianceoverfittingdouble-descentregularizationgeneralizationl2-regularization

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.
  • Generative Ai Generative AI doesn't look things up — it makes things up. Here's why that's either impressive or terrifying, depending on what you ask it to make.