Bias-Variance Tradeoff — Core Concepts

The mathematical decomposition of expected error, regularization as variance reduction, the double descent phenomenon in deep learning, and why the classical tradeoff doesn't hold for large models.

The Bias-Variance Decomposition

For a regression problem with true function $f(x)$ and model $\hat{f}(x)$ trained on dataset $D$:

$$\mathbb{E}_D[(y - \hat{f}(x))^2] = \underbrace{[\mathbb{E}D[\hat{f}(x)] - f(x)]^2}{\text{Bias}^2} + \underbrace{\mathbb{E}D[(\hat{f}(x) - \mathbb{E}D[\hat{f}(x)])^2]}{\text{Variance}} + \underbrace{\sigma^2}{\text{Irreducible Noise}}$$

Where the expectation is over all possible training datasets of a given size.

Bias: How far is the average prediction from the truth? Systematic error from model assumptions.

Variance: How much do predictions vary across different training datasets? Sensitivity to noise in training data.

Irreducible noise $\sigma^2$: Inherent randomness in the data-generating process. Cannot be reduced regardless of model.

The goal: minimize bias² + variance (you can’t reduce irreducible noise). These two terms trade off as model complexity changes.

The Classical Complexity Tradeoff

As model complexity increases:

Underfitting (high bias): Low-degree polynomial, linear model on non-linear data.

High bias: systematically wrong predictions
Low variance: predictions stable across different training sets (the model structure prevents over-adapting)
Training error ≈ validation error (both high)

Optimal complexity: The “sweet spot” where bias and variance are balanced.

Training error < validation error (some overfitting)
Validation error minimized

Overfitting (high variance): High-degree polynomial, deep neural network on small dataset.

Low bias: can fit the training data very well
High variance: predictions change wildly with different training sets (sensitive to which noise was sampled)
Training error << validation error (large generalization gap)

Regularization: Directly Trading Bias for Variance

Regularization techniques explicitly add bias in exchange for reduced variance:

L2 regularization (Ridge): Add $\lambda \sum_i w_i^2$ to the loss. This shrinks all weights toward zero.

For linear regression, the L2-regularized solution: $$\hat{w} = (X^TX + \lambda I)^{-1} X^T y$$

Compared to OLS $\hat{w} = (X^TX)^{-1} X^T y$, the $\lambda I$ term regularizes by shrinking eigenvalues of $X^TX$ — weights corresponding to low-variance directions in the data (noise directions) are shrunk most aggressively.

L1 regularization (LASSO): Add $\lambda \sum_i |w_i|$ to the loss. Unlike L2, L1 drives some weights to exactly zero — feature selection.

Dropout: Randomly zero out neurons during training. The expected output is similar to an ensemble of many smaller networks. Reduces variance by preventing co-adaptation.

Early stopping: Stop training before validation error increases. Acts as implicit regularization — gradient descent initially moves toward lower bias, then starts overfitting; stopping early finds the optimal tradeoff point.

Double Descent: Modern Deep Learning Breaks the Rule

Classical bias-variance theory predicts a U-shaped generalization curve — performance improves, then worsens, with increasing complexity.

Belkin et al. (2019) “Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-off” observed a double descent curve:

At the interpolation threshold (model just large enough to fit training data exactly): high test error (classical overfitting peak).

Beyond the interpolation threshold: test error decreases again as model size continues increasing — even for models that fit training data exactly.

This seems paradoxical but has a resolution: over-parameterized models generalize well when they use the minimum norm interpolant. Among all functions that fit the training data, gradient descent on overparameterized models naturally converges to the simplest one (minimum norm / lowest complexity). This minimum norm interpolant generalizes better than the jagged, overfit solution at the interpolation threshold.

Practical implication: For modern large neural networks, increasing model size often improves generalization even when training loss is already near zero. The classical advice “smaller models generalize better” doesn’t hold at large scale.

Benign overfitting: Bartlett et al. (2020) proved that overparameterized linear models can achieve optimal generalization while interpolating noisy training data — under specific conditions on data and model structure. This provided theoretical foundation for why large models generalize.

Diagnosing Bias vs. Variance

High bias symptoms:

Training error is high (model can’t fit the training data well)
Validation error ≈ training error (they’re both high)
Adding more training data doesn’t help much

High variance symptoms:

Training error is very low
Validation error >> training error (large generalization gap)
Adding more training data significantly reduces validation error

Learning curves: Plot training and validation error vs. training set size. High bias: both curves plateau at high error. High variance: training error is low, validation error is high, they converge slowly as training size increases.

One thing to remember: The bias-variance tradeoff is the foundation of why regularization, early stopping, and ensemble methods work — but the double descent phenomenon shows that the classical tradeoff is an incomplete story, especially for the overparameterized neural networks that dominate modern AI.

bias-varianceoverfittingdouble-descentregularizationgeneralizationl2-regularization

Bias-Variance Tradeoff — Core Concepts

The Bias-Variance Decomposition

The Classical Complexity Tradeoff

Regularization: Directly Trading Bias for Variance

Double Descent: Modern Deep Learning Breaks the Rule

Diagnosing Bias vs. Variance

See Also

Related Topics