Activation Functions — Explain Like I'm 5

Why neural networks need these tiny mathematical functions — and how ReLU's simplicity accidentally made deep learning possible.

The On/Off Switch for Neurons

Imagine a light switch. Without it, electricity flows equally regardless of whether you want light or not. The switch lets you make decisions — turn on, turn off, or in some cases, dim.

Neurons in a biological brain work similarly — they fire or they don’t, depending on whether the incoming signals are strong enough. Artificial neurons in neural networks need the same ability: to decide whether to “pass through” a signal or to suppress it.

That’s what activation functions do. They’re applied after each neuron’s calculation and decide how strongly to “fire.”

Why You Can’t Just Stack Math Without Them

Here’s the problem: if a neural network’s layers are just multiplication and addition — which is what they are by default — then stacking multiple layers is mathematically identical to having just one layer. No matter how many layers you add, you’re still doing linear math.

The world isn’t linear. A cat doesn’t look exactly like a dog, just scaled differently. Fraud isn’t just a linear combination of transaction size and location.

Activation functions add non-linearity — they bend and break the straight-line relationships, allowing neural networks to learn complex patterns that no linear function could capture.

The Three Important Ones

Sigmoid (old-fashioned): Squishes everything between 0 and 1. Looks like an S-curve. Was popular, but has “vanishing gradient” problems in deep networks — gradients become so small that learning stops in early layers.

Tanh: Similar to sigmoid but ranges from -1 to 1. Better than sigmoid for some tasks, same vanishing gradient problem.

ReLU (current standard): “Rectified Linear Unit.” Extremely simple: if the input is negative, output 0. If positive, output the input unchanged. That’s it.

$$\text{ReLU}(x) = \max(0, x)$$

ReLU sounds too simple to matter. But in 2012, AlexNet used ReLU instead of sigmoid and trained 6x faster. Deep learning would have remained mostly impractical without this seemingly trivial change.

One thing to remember: Activation functions are the thing that makes neural networks actually “neural” — without them, a 100-layer network would just be a complex way to do simple linear regression.

activation-functionsrelusigmoidneural-networksdeep-learningnonlinearity

Activation Functions — Explain Like I'm 5

The On/Off Switch for Neurons

Why You Can’t Just Stack Math Without Them

The Three Important Ones

See Also

Related Topics