Bias-Variance Tradeoff — Deep Dive

Formal Proof of the Bias-Variance Decomposition

For any estimator $\hat{f}$ and data point $(x, y)$ with $y = f(x) + \epsilon$, $\epsilon \sim \mathcal{N}(0, \sigma^2)$:

$$\mathbb{E}[(y - \hat{f})^2] = \mathbb{E}[(f(x) + \epsilon - \hat{f})^2]$$ $$= \mathbb{E}[f(x)^2 + \epsilon^2 + \hat{f}^2 + 2f(x)\epsilon - 2f(x)\hat{f} - 2\epsilon\hat{f}]$$

Since $\mathbb{E}[\epsilon] = 0$ and $\epsilon$ is independent of $\hat{f}$: $$= f(x)^2 + \sigma^2 + \mathbb{E}[\hat{f}^2] - 2f(x)\mathbb{E}[\hat{f}]$$

Adding and subtracting $\mathbb{E}[\hat{f}]^2$: $$= (f(x) - \mathbb{E}[\hat{f}])^2 + \mathbb{E}[\hat{f}^2] - \mathbb{E}[\hat{f}]^2 + \sigma^2$$ $$= \underbrace{(f(x) - \mathbb{E}[\hat{f}])^2}{\text{Bias}^2} + \underbrace{\text{Var}(\hat{f})}{\text{Variance}} + \underbrace{\sigma^2}_{\text{Irreducible noise}}$$

For classification (0/1 loss), the decomposition is different and more complex (Kohavi & Wolpert, 1996). The bias-variance decomposition is exact only for squared error loss.

Bias-Variance in Kernel Methods

Kernel regression provides clean geometric intuition. For kernel $K$, the minimum-norm interpolant in reproducing kernel Hilbert space (RKHS) is:

$$\hat{f} = \sum_i \alpha_i K(x, x_i), \quad \text{where} \quad K\alpha = y$$

The test error decomposes as: $$\mathbb{E}[|\hat{f} - f^|^2] = \underbrace{|P_\perp f^|^2}{\text{Bias}} + \underbrace{\sigma^2 \text{tr}(K^{-1})}{\text{Variance}}$$

Where $P_\perp f^$ is the component of $f^$ orthogonal to the span of the kernel basis.

As the regularization parameter $\lambda$ in kernel ridge regression increases:

  • Bias² increases (model fits data less precisely, orthogonal component grows)
  • Variance decreases (tr($K + \lambda I)^{-1}$) shrinks)

This is the exact mathematical expression of the bias-variance tradeoff in kernel methods.

Neural Tangent Kernel and Infinite-Width Networks

Jacot et al. (2018) “Neural Tangent Kernel: Convergence and Generalization in Neural Networks” showed that infinitely wide neural networks are equivalent to kernel regression with a specific kernel — the NTK.

For a neural network $f(x; \theta)$ with infinitely wide hidden layers, define:

$$\Theta(x, x’) = \nabla_\theta f(x;\theta)^T \nabla_\theta f(x’;\theta)$$

In the infinite width limit, $\Theta$ converges to a deterministic kernel $\Theta^$ that doesn’t change during training. The network trained by gradient descent converges to the same solution as kernel regression with kernel $\Theta^$.

Implications for bias-variance:

In the kernel regime (wide, early training): The network is effectively a linear model in the tangent space. Bias = projection of true function onto RKHS orthogonal complement; variance = model complexity determined by kernel.

This gives a rigorous basis for analyzing when wide neural networks generalize: they generalize well when the true function $f^*$ has low complexity under the NTK — i.e., can be well-represented as a combination of the NTK kernel functions.

SGD Noise as Variance Inducer

Standard gradient descent has zero variance (deterministic updates). Stochastic gradient descent introduces mini-batch variance.

For loss $\mathcal{L}(\theta) = \frac{1}{n}\sum_i \ell(f(x_i; \theta), y_i)$:

True gradient: $\nabla \mathcal{L}(\theta) = \frac{1}{n}\sum_i \nabla \ell_i(\theta)$

Stochastic gradient (batch size $b$): $\hat{g} = \frac{1}{b}\sum_{i \in B} \nabla \ell_i(\theta)$

The variance of the stochastic gradient: $\text{Var}(\hat{g}) = \frac{\sigma_g^2}{b}$ where $\sigma_g^2$ is the per-sample gradient variance.

SGD noise as regularization (Hoffer et al., 2017; Mandt et al., 2017): Under certain conditions, SGD with learning rate $\eta$ and batch size $b$ behaves like Langevin dynamics — effectively sampling from a posterior distribution with effective temperature $T_{eff} = \eta \sigma_g^2 / (2b)$.

This means SGD implicitly adds a bias-variance tradeoff: large learning rates and small batches (high effective temperature) encourage exploration and reduce variance in the model ensemble sense. Small learning rates and large batches (low effective temperature) reduce variance but can get stuck in local optima.

The “generalization of large batch training” problem: Large batch training reduces gradient variance and is faster per epoch, but often generalizes worse than small batch training. Explanations:

  1. Less noise → converges to sharper local minima that generalize worse
  2. Less noise → equivalent to fewer exploration steps before convergence
  3. Batch normalization statistics differ between large and small batch settings

Mitigations: learning rate warmup, ghost batch norm, higher learning rates for large batches.

Double Descent: Formal Understanding

Bartlett et al. (2020) “Benign Overfitting in Linear Regression” proved that overparameterized linear models can achieve near-optimal test error while perfectly interpolating noisy training data under the following conditions:

Setting: $y = Xw^* + \epsilon$ with $X \in \mathbb{R}^{n \times d}$, $d > n$ (overparameterized).

Minimum norm estimator: $\hat{w} = X^+ y$ where $X^+$ is the Moore-Penrose pseudoinverse.

Test error decomposition: $$\mathbb{E}[|\hat{w} - w^|^2] = \underbrace{|P_0 w^|^2}{\text{Bias}} + \underbrace{\sigma^2 \text{tr}(P_1)}{\text{Variance}}$$

Where $P_0 = I - X^+X$ (projection onto null space of $X$) and $P_1 = X^+ (X^+)^T$.

For high-dimensional data with specific distribution properties (effectively random $X$ with controlled spectrum), both bias and variance vanish as $d \rightarrow \infty$ with $d/n \rightarrow \infty$ — benign overfitting.

The key condition: the data matrix $X$ must have enough signal in the right subspace that the minimum norm estimator can “spread” the interpolation error across many dimensions, keeping any individual dimension’s contribution small.

One thing to remember: The bias-variance decomposition is exact and rigorous, but modern deep learning’s double descent regime shows that the classical assumption — more complexity means more variance — breaks down when models are sufficiently overparameterized and trained with implicit regularization through gradient descent.

bias-variancedouble-descentneural-tangent-kernelkernel-methodssgd-noise

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.
  • Generative Ai Generative AI doesn't look things up — it makes things up. Here's why that's either impressive or terrifying, depending on what you ask it to make.