Activation Functions — Core Concepts

The Universal Approximation Theorem

Why do activation functions matter mathematically? The Universal Approximation Theorem (Cybenko, 1989; Hornik, 1991) provides the answer.

Theorem: A feedforward neural network with at least one hidden layer and a non-linear activation function can approximate any continuous function on a compact domain to arbitrary precision, given sufficient hidden units.

Without non-linear activations, stacking layers provides no benefit. A linear transformation of a linear transformation is still linear: $$W_2(W_1 x + b_1) + b_2 = (W_2 W_1) x + (W_2 b_1 + b_2)$$

This is mathematically equivalent to a single matrix multiplication. No amount of depth helps.

With non-linear activations, deep networks can represent exponentially more complex functions with each layer — this is why depth matters.

ReLU: What Makes It Special

ReLU: $f(x) = \max(0, x)$

Properties that made it transformative:

  1. Sparse activation: Roughly 50% of neurons output 0 for typical inputs — creating sparse, efficient representations
  2. Non-saturating gradients: For positive inputs, gradient = 1. No vanishing gradient problem for the active neurons
  3. Computational efficiency: Just a comparison and max operation — much faster than sigmoid’s exponential
  4. Biological plausibility: More similar to how neurons fire (sparse, threshold-based)

ReLU in AlexNet (2012) enabled training 6x faster than equivalent sigmoid networks — directly contributing to the ImageNet breakthrough.

The Dying ReLU Problem

ReLU’s zero-for-negative-input property creates a problem: dying ReLU.

If a ReLU neuron consistently receives negative inputs, it will always output 0. During backpropagation, the gradient through a ReLU unit is 0 when the input is negative — so the weights feeding into this neuron receive zero gradient. They never update. The neuron is permanently “dead.”

This happens more often with:

  • High learning rates (large gradient updates push weights into consistently negative input regime)
  • Poor initialization (weights that produce negative inputs from the start)
  • Batch normalization issues

Solutions:

Leaky ReLU: For negative inputs, output a small non-zero value $f(x) = \max(\alpha x, x)$ with $\alpha \approx 0.01$. Small gradient for negative inputs prevents dying.

Parametric ReLU (PReLU): $\alpha$ is a learnable parameter. The network learns the optimal slope for negative inputs.

ELU (Exponential Linear Unit): $f(x) = x$ for $x > 0$; $f(x) = \alpha(e^x - 1)$ for $x \leq 0$. Smooth, non-zero derivative everywhere, produces negative outputs (helps normalization).

GELU: The Modern Standard

GELU (Gaussian Error Linear Unit, Hendrycks & Gimpel, 2016) became the standard for transformer models (BERT, GPT, and virtually all modern LLMs).

$$\text{GELU}(x) = x \cdot \Phi(x)$$

Where $\Phi(x)$ is the standard Gaussian CDF. Intuitively: GELU weights each input by the probability that a Gaussian random variable is less than or equal to $x$. At large positive $x$: $\Phi(x) \approx 1$, so GELU ≈ identity. At large negative $x$: $\Phi(x) \approx 0$, so GELU ≈ 0. But unlike ReLU, the transition is smooth.

Approximation (fast computation): $$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right)$$

Why GELU instead of ReLU for transformers? Empirically better performance on language tasks. The smooth, stochastic interpretation may help with gradient flow in very deep models.

SwiGLU: The LLM Feed-Forward Layer

Modern LLMs (LLaMA, Mistral, Gemma) use SwiGLU (Swish-Gated Linear Unit) in their feed-forward layers. Noam Shazeer (Google, 2020) proposed this variant.

Standard FFN: $$\text{FFN}(x) = \sigma(W_1 x + b_1) \cdot (W_2 x + b_2)$$

SwiGLU replaces this with a gated unit: $$\text{SwiGLU}(x) = \text{Swish}(W_1 x) \otimes (W_3 x)$$

Where $\text{Swish}(x) = x \cdot \sigma(x)$ (smooth, non-monotonic) and $W_3$ is an additional gate weight matrix.

The key innovation: gating — one branch of the computation gates the other. The $W_3 x$ branch acts as a learned filter for which information to pass through.

SwiGLU with 2/3 the hidden dimension of standard FFN achieves better performance at equivalent parameter count. LLaMA and most modern open-source LLMs use this configuration.

Why gating helps: The gating mechanism allows the layer to be selective about which features to amplify, providing a form of learned sparsity — features not needed for the current token can be gated out.

One thing to remember: Activation function choice matters more in transformer FFN layers (where SwiGLU dominates) than in the attention mechanism itself — and the progression from sigmoid to ReLU to GELU to SwiGLU tracks the increasing empirical understanding of what actually works at scale.

activation-functionsrelugeluswigludying-relunonlinearity

See Also

  • Attention Mechanism The trick that made ChatGPT possible — how AI learned to focus on what actually matters instead of reading everything equally.
  • Batch Normalization The 2015 trick that let researchers train much deeper neural networks — why keeping numbers in the right range makes AI learn 10x faster.
  • Convolutional Neural Networks How AI learned to see — the surprisingly simple idea behind face recognition, self-driving cars, and medical imaging.
  • Dropout Regularization How randomly switching off neurons during training makes AI models that generalize better — the counterintuitive trick that stopped neural networks from memorizing everything.
  • Generative Adversarial Networks How two AI networks competing against each other created the technology behind deepfakes, AI art, and synthetic data — the forger vs. the detective.