Recurrent Neural Networks — Deep Dive

BPTT mathematics, LSTM gate analysis, gradient pathology in depth, seq2seq with Bahdanau attention, and the State Space Model revival that challenged transformers.

Backpropagation Through Time (BPTT)

Training an RNN on a sequence of length $T$ requires computing gradients with respect to the shared weight matrices. BPTT unrolls the computation graph across timesteps and applies standard backpropagation.

For the vanilla RNN hidden state $h_t = \tanh(W_{hh}h_{t-1} + W_{xh}x_t)$, the gradient of the loss at the final step with respect to $h_k$ is:

$$\frac{\partial \mathcal{L}T}{\partial h_k} = \frac{\partial \mathcal{L}T}{\partial h_T} \prod{t=k+1}^{T} \frac{\partial h_t}{\partial h{t-1}}$$

Each factor is:

$$\frac{\partial h_t}{\partial h_{t-1}} = W_{hh}^T \cdot \text{diag}(\tanh’(W_{hh}h_{t-1} + W_{xh}x_t))$$

Where $\tanh’(z) = 1 - \tanh^2(z) \in (0, 1]$.

With $T - k$ matrix multiplications, the gradient magnitude scales as $|\frac{\partial \mathcal{L}}{\partial h_k}| \approx |W_{hh}|^{T-k}$. For $|W_{hh}|{spec} < 1$: exponential decay (vanishing). For $|W{hh}|_{spec} > 1$: exponential growth (exploding).

Truncated BPTT is the practical solution for long sequences: backpropagate only $k$ steps into the past rather than the full sequence. This caps gradient computation but means dependencies beyond $k$ steps can’t be learned.

The LSTM Gradient Highway

The key to LSTM’s vanishing-gradient resistance is the additive cell state update:

$$c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t$$

The gradient of $c_k$ with respect to $c_j$ (for $j < k$) is:

$$\frac{\partial c_k}{\partial c_j} = \prod_{t=j+1}^{k} f_t$$

This product of forget gates — rather than matrix multiplications — can stay close to 1 when the forget gate learns to output values near 1. When the LSTM learns “remember this”, $f_t \approx 1$ for those positions, and the gradient flows essentially unchanged.

This is analogous to ResNet’s skip connections: replacing multiplicative recurrence with an additive update creates a gradient highway.

Initialization Matters

The forget gate bias is often initialized to +1 or +2 (pushing initial forget gate activations toward 1). This empirically improves training stability — the model starts in “remember everything” mode and learns to forget selectively.

Gated Recurrent Units: Derivation and Comparison

The GRU (Cho et al., 2014) was introduced alongside the encoder-decoder seq2seq architecture. The motivation: simplify LSTMs while retaining gated information control.

Update gate: $z_t = \sigma(W_z [h_{t-1}, x_t])$

Reset gate: $r_t = \sigma(W_r [h_{t-1}, x_t])$

Candidate hidden state: $\tilde{h}t = \tanh(W [r_t \odot h{t-1}, x_t])$

Hidden state update: $h_t = (1 - z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t$

The interpolation structure of the update step is notable: when $z_t \approx 0$, $h_t \approx h_{t-1}$ (copy the past); when $z_t \approx 1$, $h_t \approx \tilde{h}_t$ (full update). This directly controls how much the hidden state changes at each step.

Empirically, GRUs and LSTMs perform similarly on most benchmarks, with GRUs faster to train and LSTMs slightly better on very long sequences.

Seq2Seq and Bahdanau Attention

The encoder-decoder architecture (Sutskever et al., 2014) uses one RNN to encode an input sequence into a fixed-size context vector $c = h_T$ (the final encoder hidden state), then conditions a decoder RNN on $c$ to generate the output sequence.

The bottleneck problem: All information from a potentially long input sentence must be compressed into a single vector. For short sentences this works; for sentences of 20+ words, translation quality degrades sharply.

Bahdanau attention (2014) solved this by allowing the decoder to attend to all encoder hidden states at each decoding step:

$$e_{tj} = a(s_{t-1}, h_j)$$ $$\alpha_{tj} = \frac{\exp(e_{tj})}{\sum_k \exp(e_{tk})}$$ $$c_t = \sum_j \alpha_{tj} h_j$$

Where $s_{t-1}$ is the decoder’s previous hidden state, $h_j$ are encoder hidden states, and $a$ is an alignment model (a small feedforward network).

The context vector $c_t$ at each decoder step is a weighted sum of encoder states — the model “attends” to different input positions for each output token. Visualizing the attention weights $\alpha_{tj}$ shows learnable word alignments between source and target language.

This was the direct conceptual predecessor to the attention mechanism in transformers. Vaswani et al. (2017) asked: what if you replaced the RNN entirely and used only attention?

Multi-Layer and Bidirectional Extensions

For a stacked $L$-layer RNN:

$$h_t^{(l)} = f(h_t^{(l-1)}, h_{t-1}^{(l)})$$

Layer $l$‘s hidden state takes its input from the layer below (current timestep) and from itself (previous timestep). Google’s original production NMT system used an 8-layer LSTM encoder with residual connections between layers — a depth that required careful initialization.

Bidirectional RNNs concatenate forward and backward hidden states:

$$\overrightarrow{h}t = f(x_t, \overrightarrow{h}{t-1})$$ $$\overleftarrow{h}t = f(x_t, \overleftarrow{h}{t+1})$$ $$h_t = [\overrightarrow{h}_t; \overleftarrow{h}_t]$$

Bidirectionality doubles the hidden state size. For ELMo (Peters et al., 2018) — LSTM-based contextualized word embeddings — bidirectional LSTMs produced representations where the word “bank” in “river bank” and “bank account” had measurably different vector representations. This was a significant step toward BERT-style pretraining.

State Space Models: The RNN Revival

By 2022–2023, researchers questioned whether transformers were fundamentally necessary. State Space Models (SSMs) offer RNN-like sequence processing with parallelizable training.

The core SSM recurrence is linear:

$$h_t = \mathbf{A}h_{t-1} + \mathbf{B}x_t$$ $$y_t = \mathbf{C}h_t$$

With structured $\mathbf{A}$ matrices (e.g., diagonal or HiPPO — High-order Polynomial Projection Operators), these models can capture long-range dependencies that vanilla RNNs can’t.

The key insight: linear SSMs have a convolution equivalent. The same computation can be done as a sequential recurrence (for inference — O(1) state per step) or as a global convolution (for training — parallelizable). This gives the best of both worlds.

Mamba (Gu & Dao, 2023) extended this with selective state spaces — the $\mathbf{B}$, $\mathbf{C}$ matrices and step size become input-dependent, giving the model attention-like selectivity without quadratic complexity. Mamba matched transformer performance on language modeling at similar scales while running 5x faster at long sequence lengths.

By 2024, hybrid models (some layers attention, some layers SSM) began appearing, and the question of whether transformers or SSMs are “correct” for language remained genuinely open.

Practical Implementation Notes

For PyTorch implementation:

import torch.nn as nn

class LSTMModel(nn.Module):
    def __init__(self, input_size, hidden_size, num_layers, output_size):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size, hidden_size, num_layers,
            batch_first=True, dropout=0.2, bidirectional=True
        )
        self.fc = nn.Linear(hidden_size * 2, output_size)  # *2 for bidirectional
    
    def forward(self, x):
        # x: (batch, seq_len, input_size)
        out, (h_n, c_n) = self.lstm(x)
        # out: (batch, seq_len, hidden_size * 2)
        return self.fc(out[:, -1, :])  # Use last timestep

For very long sequences (>1000 tokens), gradient clipping is essential:

torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

One thing to remember: The mathematical innovations from RNN research — gating mechanisms, additive state updates, and attention over hidden states — directly became the building blocks of transformer architecture, making RNNs not obsolete but foundational.

deep-learningrnnlstmbpttvanishing-gradientsmambastate-space-models