Recurrent Neural Networks — Core Concepts

The sequence processing architecture that dominated NLP from 2013 to 2017 — how RNNs, LSTMs, and GRUs work, and why transformers eventually replaced them.

The Fundamental Problem With Sequential Data

Most neural networks assume inputs are independent: classifying image A has nothing to do with classifying image B. But language, speech, video, and time series data are fundamentally sequential — each element depends on what came before.

A sentence like “The trophy didn’t fit in the suitcase because it was too big” requires understanding that “it” refers to the trophy, not the suitcase. Resolving this requires tracking context across many words.

Recurrent Neural Networks address this by maintaining a hidden state — a compressed summary of the sequence seen so far — that’s updated at each step.

How RNNs Work

At each timestep $t$, an RNN takes two inputs:

The current input $x_t$ (e.g., the current word embedding)
The previous hidden state $h_{t-1}$ (the memory of prior context)

And produces:

The new hidden state $h_t$
An optional output $y_t$ (e.g., a prediction for this timestep)

The update is: $$h_t = \tanh(W_{hh} h_{t-1} + W_{xh} x_t + b)$$ $$y_t = W_{hy} h_t$$

The same weight matrices ($W_{hh}$, $W_{xh}$, $W_{hy}$) are used at every timestep — parameter sharing across time. This is how a single RNN can handle sequences of variable length.

The Vanishing Gradient Problem

Training RNNs requires backpropagation through time (BPTT) — unrolling the sequence and backpropagating through all timesteps. For a sequence of length $T$, the gradient at time $t=0$ depends on multiplying the Jacobian of the hidden state transition $T$ times.

When the spectral radius of $W_{hh}$ is less than 1 (common after weight initialization), these repeated multiplications cause gradients to exponentially shrink — the vanishing gradient. Information from early in the sequence effectively disappears.

When the spectral radius is greater than 1, gradients explode (gradient clipping is a common fix for this, but not for vanishing).

In practice, vanilla RNNs struggle to learn dependencies spanning more than ~10–20 timesteps. This was a critical limitation for language modeling.

LSTMs: Long Short-Term Memory

Sepp Hochreiter and Jürgen Schmidhuber introduced LSTMs in 1997 specifically to address vanishing gradients. The key innovation: separate the hidden state into a cell state $c_t$ (long-term memory) and a hidden state $h_t$ (short-term, exposed to output layers).

Three multiplicative gates control information flow:

Forget gate: How much of the previous cell state to retain $$f_t = \sigma(W_f [h_{t-1}, x_t] + b_f)$$

Input gate + candidate values: What new information to write $$i_t = \sigma(W_i [h_{t-1}, x_t] + b_i)$$ $$\tilde{c}t = \tanh(W_c [h{t-1}, x_t] + b_c)$$

Cell state update: Combine forgetting and new writing $$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

Output gate: What to expose from the cell state $$o_t = \sigma(W_o [h_{t-1}, x_t] + b_o)$$ $$h_t = o_t \odot \tanh(c_t)$$

The cell state update uses addition (not multiplication), creating a direct gradient path across time. This is conceptually similar to residual connections — an additive highway that prevents gradient vanishing.

GRUs: A Simpler Alternative

Gated Recurrent Units (Cho et al., 2014) merge the cell and hidden states into a single state and use only two gates:

Update gate $z_t$: How much of the old state to keep
Reset gate $r_t$: How much of the old state to use when computing the new candidate

GRUs have ~33% fewer parameters than LSTMs per layer while achieving similar performance on most tasks. They’re faster to train and often preferred when compute is limited.

Architecture Variants

Bidirectional RNNs: Run two RNNs on the sequence — one forward, one backward — and concatenate their hidden states. The forward pass gives context from the past; the backward pass gives context from the future. Critical for tasks where both matter (e.g., NER, where knowing the word after helps identify the word before).

Deep (stacked) RNNs: Multiple RNN layers stacked, with each layer’s hidden state becoming the input to the next. 2–4 layers was standard for translation tasks.

Encoder-Decoder (Seq2Seq): Two RNNs — an encoder reads the input sequence and compresses it into a context vector; a decoder generates the output sequence conditioned on that vector. Used for machine translation (Bahdanau et al., 2014 added an attention mechanism to this, which became the direct precursor to transformers).

Why Transformers Replaced RNNs

By 2017–2018, RNNs were at their practical limits for several reasons:

Sequential computation: RNNs can’t parallelize across timesteps — you must compute $h_2$ before $h_3$. This makes training slow on long sequences.
Long-range dependencies: Even with LSTMs, dependencies spanning hundreds of tokens are unreliable.
Fixed-size bottleneck: In seq2seq, all information must pass through a single context vector.

The Transformer architecture (Vaswani et al., 2017, “Attention Is All You Need”) solved all three: parallel computation, direct attention between any two positions, no sequential bottleneck. By 2019, BERT and GPT-2 demonstrated transformers dramatically outperforming LSTM-based models at scale.

Where RNNs Still Appear

RNNs aren’t extinct:

Streaming inference: When processing data token-by-token in real time (e.g., on-device voice assistants), RNNs are stateful and efficient; transformers require reprocessing full context windows
State Space Models: Mamba (2023) and similar architectures blend ideas from RNNs and attention, achieving transformer-quality results with RNN-style inference efficiency
Resource-constrained devices: LSTMs in microcontrollers for IoT sensor processing

One thing to remember: LSTMs solved the vanilla RNN’s memory problem with clever gating, and seq2seq with attention paved the path directly to the transformer — modern language AI is built on the lessons RNNs taught.

deep-learningrnnlstmgrusequence-modelingnlp