Activation Functions — Deep Dive

Deep vs. Wide: The Depth Advantage

The Universal Approximation Theorem says one hidden layer is sufficient — in principle. In practice, deep networks are dramatically more efficient than shallow ones.

Expressiveness: A deep network with $k$ layers, each width $n$, can represent functions that would require exponentially wider single-layer networks. Specifically, for piecewise-linear functions (which ReLU networks implement), a $k$-layer ReLU network can implement exponentially more linear regions than a 1-layer network with the same parameters.

Count of linear regions (Montufar et al., 2014): A deep ReLU network with $n_0$ inputs, $k$ hidden layers of width $n \geq n_0$, and 1 output can achieve up to:

$$\binom{n}{n_0}^{k-1} \cdot 2^{n_0 n}$$

linear regions — exponential in depth $k$. A single hidden layer with equivalent parameters achieves $O(n^{n_0/k \cdot k} / (n_0/k)!)$ regions — polynomially fewer.

This depth advantage explains why deep networks generalize better with fewer parameters: they can fit complex functions using fewer pieces, which implies smoother, more general representations.

ReLU Networks as Polyhedral Functions

A ReLU network implements a continuous piecewise-linear (CPWL) function. Each neuron’s ReLU activation creates a “bend” in the function — above the threshold, it’s linear; below, it’s zero.

For a network with $n$ hidden units, the input space is divided into up to $2^n$ polyhedral regions. In each region, the network implements a different linear function (the specific subset of neurons that are “on” determines the linear piece).

The arrangement: The boundaries between regions are hyperplanes (decision boundaries of individual neurons). As data moves through the network, it moves through this polyhedral decomposition of input space.

This geometric interpretation has practical implications:

  • Initialization: If weights are poorly initialized, most neurons might be in the same (off or on) state, reducing effective regions and expressiveness
  • Network collapse: When all paths through the network produce the same linear piece, the network behaves like a single linear transform
  • Generalization: Functions that require few linear pieces generalize better — they’re smoother and less overfit to training data

Dying ReLU: Quantitative Analysis

How severe is the dying ReLU problem? For a network initialized with Kaiming (He) initialization:

A neuron receives input $z = \sum_i w_i x_i + b$. With Kaiming initialization, $w_i \sim \mathcal{N}(0, 2/\text{fan_in})$ and $b = 0$.

For a random input $x$ with components $\sim \mathcal{N}(0, 1)$: $$\text{Var}(z) = \text{fan_in} \cdot \text{Var}(w) \cdot \text{Var}(x) = \text{fan_in} \cdot \frac{2}{\text{fan_in}} \cdot 1 = 2$$

So $z \sim \mathcal{N}(0, 2)$ approximately. Probability of $z > 0$ = 50%. ReLU will output non-zero for approximately 50% of neurons at initialization.

After a large gradient step $\Delta w = -\eta \nabla_w \mathcal{L}$, weights shift such that $P(z < 0)$ can approach 1 for some neurons. Once at 100% dead, no gradient flows and the neuron remains dead permanently.

With learning rate $\eta = 0.01$ (typical): probability of any neuron dying per training step is low. With $\eta = 0.1$: risk becomes significant.

Mitigation via initialization: He initialization (Kaiming, 2015) was specifically designed to maintain variance through ReLU layers: $$\text{Var}(w) = \frac{2}{\text{fan_in}}$$

This maintains approximately unit variance in activations through deep networks despite ReLU’s 0-for-negative behavior.

Smooth Activations: Mish and Variants

Mish (Misra, 2019): $f(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh(\ln(1 + e^x))$

Properties:

  • Smooth, non-monotonic
  • Unbounded above, bounded below (~-0.31)
  • Self-regularizing (small magnitude for negative inputs without hard clamping)

Mish showed improved accuracy over ReLU and Swish on many image classification benchmarks. The YOLOv4 and YOLOv5 object detectors use Mish.

Why non-monotonic might help: allows backpropagation of small negative gradients even for negative inputs, creating richer gradient flow than monotonic functions.

Comparison of smooth activations on ImageNet top-1 (approximate, ResNet-50):

  • ReLU: 76.1%
  • ELU: 76.4%
  • Swish: 77.0%
  • Mish: 77.2%
  • GELU: ~77%

Differences are small but consistent across architectures and tasks.

GLU Variants: The Mathematics of Gating

Dauphin et al. (2017) introduced Gated Linear Units: $$\text{GLU}(x, W, V, b, c) = \sigma(xW + b) \otimes (xV + c)$$

Where $\otimes$ is element-wise multiplication and $\sigma$ is sigmoid. The sigmoid gate $\sigma(xW + b)$ learns which features to pass through.

Generalized GLU family (Shazeer, 2020): Replace sigmoid with other activations:

NameFormulaActivation
GLU$\sigma(xW) \otimes (xV)$Sigmoid
Bilinear$(xW) \otimes (xV)$Linear
ReGLU$\text{ReLU}(xW) \otimes (xV)$ReLU
GEGLU$\text{GELU}(xW) \otimes (xV)$GELU
SwiGLU$\text{Swish}(xW) \otimes (xV)$Swish

All GLU variants add a multiplication between two projections — this creates a learnable “gate” that controls information flow. The gate is context-dependent: different inputs activate different gates.

Why gates beat simple non-linearities in FFN layers: The gate’s multiplicative interaction creates a second-order (multiplicative) feature interaction rather than first-order (additive). This is more expressive for the same number of parameters — critical in the FFN layers of transformers where most parameters live.

One thing to remember: The activation function’s most important property for modern deep learning isn’t its specific shape — it’s whether it enables gradient flow (no vanishing gradients), allows the network to express sparse or selective computation (gating), and works consistently across the full range of inputs encountered during training.

activation-functionsuniversal-approximationrelu-geometrymishglupolyhedral-regions

See Also

  • Attention Mechanism The trick that made ChatGPT possible — how AI learned to focus on what actually matters instead of reading everything equally.
  • Batch Normalization The 2015 trick that let researchers train much deeper neural networks — why keeping numbers in the right range makes AI learn 10x faster.
  • Convolutional Neural Networks How AI learned to see — the surprisingly simple idea behind face recognition, self-driving cars, and medical imaging.
  • Dropout Regularization How randomly switching off neurons during training makes AI models that generalize better — the counterintuitive trick that stopped neural networks from memorizing everything.
  • Generative Adversarial Networks How two AI networks competing against each other created the technology behind deepfakes, AI art, and synthetic data — the forger vs. the detective.