Activation Functions — Core Concepts
The Universal Approximation Theorem
Why do activation functions matter mathematically? The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) provides the answer.
Theorem: A feedforward neural network with at least one hidden layer and a non-linear activation function can approximate any continuous function on a compact domain to arbitrary precision, given sufficient hidden units.
Without non-linear activations, stacking layers provides no benefit. A linear transformation of a linear transformation is still linear: $$W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$$
This is mathematically equivalent to a single matrix multiplication. No amount of depth helps.
With non-linear activations, deep networks can represent exponentially more complex functions with each layer — this is why depth matters.
ReLU: What Makes It Special
ReLU: $f(x) = \max(0, x)$
Properties that made it transformative:
- Sparse activation: Roughly 50% of neurons output 0 for typical inputs — creating sparse, efficient representations
- Non-saturating gradients: For positive inputs, gradient = 1. No vanishing gradient problem for the active neurons
- Computational efficiency: Just a comparison and max operation — much faster than sigmoid’s exponential
- Biological plausibility: More similar to how neurons fire (sparse, threshold-based)
ReLU in AlexNet (2012) enabled training 6x faster than equivalent sigmoid networks — directly contributing to the ImageNet breakthrough.
The Dying ReLU Problem
ReLU’s zero-for-negative-input property creates a problem: dying ReLU.
If a ReLU neuron consistently receives negative inputs, it will always output 0. During backpropagation, the gradient through a ReLU unit is 0 when the input is negative — so the weights feeding into this neuron receive zero gradient. They never update. The neuron is permanently “dead.”
This happens more often with:
- High learning rates (large gradient updates push weights into consistently negative input regime)
- Poor initialization (weights that produce negative inputs from the start)
- Batch normalization issues
Solutions:
Leaky ReLU: For negative inputs, output a small non-zero value $f(x) = \max(\alpha x, x)$ with $\alpha \approx 0.01$. Small gradient for negative inputs prevents dying.
Parametric ReLU (PReLU): $\alpha$ is a learnable parameter. The network learns the optimal slope for negative inputs.
ELU (Exponential Linear Unit): $f(x) = x$ for $x > 0$; $f(x) = \alpha(e^x - 1)$ for $x \leq 0$. Smooth, non-zero derivative everywhere, produces negative outputs (helps normalization).
GELU: The Modern Standard
GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016) became the standard for transformer models (BERT, GPT, and virtually all modern LLMs).
$$\text{GELU}(x) = x \cdot \Phi(x)$$
Where $\Phi(x)$ is the standard Gaussian CDF. Intuitively: GELU weights each input by the probability that a Gaussian random variable is less than or equal to $x$. At large positive $x$: $\Phi(x) \approx 1$, so GELU ≈ identity. At large negative $x$: $\Phi(x) \approx 0$, so GELU ≈ 0. But unlike ReLU, the transition is smooth.
Approximation (fast computation): $$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right)$$
Why GELU instead of ReLU for transformers? Empirically better performance on language tasks. The smooth, stochastic interpretation may help with gradient flow in very deep models.
SwiGLU: The LLM Feed-Forward Layer
Modern LLMs (LLaMA, Mistral, Gemma) use SwiGLU (Swish-Gated Linear Unit) in their feed-forward layers. Noam Shazeer (Google, 2020) proposed this variant.
Standard FFN: $$\text{FFN}(x) = \sigma(W_1 x + b_1) \cdot (W_2 x + b_2)$$
SwiGLU replaces this with a gated unit: $$\text{SwiGLU}(x) = \text{Swish}(W_1 x) \otimes (W_3 x)$$
Where $\text{Swish}(x) = x \cdot \sigma(x)$ (smooth, non-monotonic) and $W_3$ is an additional gate weight matrix.
The key innovation: gating — one branch of the computation gates the other. The $W_3 x$ branch acts as a learned filter for which information to pass through.
SwiGLU with 2/3 the hidden dimension of standard FFN achieves better performance at equivalent parameter count. LLaMA and most modern open-source LLMs use this configuration.
Why gating helps: The gating mechanism allows the layer to be selective about which features to amplify, providing a form of learned sparsity — features not needed for the current token can be gated out.
One thing to remember: Activation function choice matters more in transformer FFN layers (where SwiGLU dominates) than in the attention mechanism itself — and the progression from sigmoid to ReLU to GELU to SwiGLU tracks the increasing empirical understanding of what actually works at scale.
See Also
- Attention Mechanism The trick that made ChatGPT possible — how AI learned to focus on what actually matters instead of reading everything equally.
- Batch Normalization The 2015 trick that let researchers train much deeper neural networks — why keeping numbers in the right range makes AI learn 10x faster.
- Convolutional Neural Networks How AI learned to see — the surprisingly simple idea behind face recognition, self-driving cars, and medical imaging.
- Dropout Regularization How randomly switching off neurons during training makes AI models that generalize better — the counterintuitive trick that stopped neural networks from memorizing everything.
- Generative Adversarial Networks How two AI networks competing against each other created the technology behind deepfakes, AI art, and synthetic data — the forger vs. the detective.