Transformer Architecture — Deep Dive

From the math behind scaled dot-product attention to KV caching, Flash Attention, and why the quadratic complexity wall might kill Transformers — or why it might not matter.

The Architecture, Layer by Layer

Let’s build a Transformer from scratch — conceptually, then mathematically.

Input Pipeline

Raw text doesn’t go in directly. It gets tokenized (typically via BPE or WordPiece), converted to integer IDs, then mapped to dense embedding vectors. For a model with d_model = 512, each token becomes a 512-dimensional float vector.

To this, positional encoding is added:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Each position gets a unique frequency-based signature. The elegance: the model can generalize to sequence lengths not seen during training, because the sinusoidal patterns have a systematic structure. Newer architectures (GPT-NeoX, LLaMA) use RoPE — Rotary Position Embedding — which encodes relative positions directly into the attention computation instead of adding absolute positions up front. This turns out to generalize better.

Scaled Dot-Product Attention — The Math

Given input matrix X of shape (seq_len × d_model), three weight matrices project it:

Q = X · W_Q    # shape: (seq_len × d_k)
K = X · W_K    # shape: (seq_len × d_k)  
V = X · W_V    # shape: (seq_len × d_v)

Attention is then:

Attention(Q, K, V) = softmax(QK^T / √d_k) · V

The √d_k scaling prevents the dot products from growing too large in magnitude (which would push softmax into saturation and kill gradients). Without this, dot products between 64-dimensional vectors average magnitude ~8, which causes softmax to become extremely peaked, making gradients near-zero almost everywhere.

The output is a weighted average of value vectors — each output position is a blend of all value vectors, weighted by how much the query at that position matched each key.

Multi-Head Attention

Instead of one attention function over full-dimensional queries, split into h heads:

def multi_head_attention(Q, K, V, h=8):
    d_k = d_model // h
    outputs = []
    for i in range(h):
        q_i = Q @ W_Q_i   # (seq × d_k)
        k_i = K @ W_K_i
        v_i = V @ W_V_i
        attn_i = scaled_dot_product_attention(q_i, k_i, v_i)
        outputs.append(attn_i)
    concatenated = concat(outputs)   # (seq × d_model)
    return concatenated @ W_O

Each head operates in a lower-dimensional subspace (d_k = d_model/h, typically 64 for d_model=512, h=8). The heads learn independent attention patterns — empirical research from Anthropic, Google, and MIT has shown that specific heads consistently specialize: some track subject-verb agreement, some handle coreference, some focus on positional neighbors.

Feed-Forward Sublayer

Each Transformer block applies attention then a position-wise FFN:

FFN(x) = max(0, xW₁ + b₁)W₂ + b₂

Typically d_ff = 4 × d_model. The expansion-contraction pattern is not arbitrary: the wider hidden layer gives the model capacity to store factual associations. Recent work suggests the FFN layers act as “key-value memory” for factual recall, while attention handles relational structure.

Modern variants replace ReLU with SwiGLU (used in LLaMA, PaLM):

SwiGLU(x) = (xW₁) ⊗ σ(xW₂) · W₃

Where ⊗ is elementwise product and σ is sigmoid. SwiGLU consistently outperforms ReLU/GELU across scales.

Residual Connections + Layer Norm

Each sublayer wraps in:

x = LayerNorm(x + Sublayer(x))

Residual connections (borrowed from ResNet) allow gradients to flow through deep stacks without vanishing. Layer normalization stabilizes training — without it, 24+ layer networks are nearly untrainable.

Pre-norm vs. post-norm matters more than it sounds. The original paper used post-norm (normalize after adding residual). Most modern large models use pre-norm (normalize before sublayer). Pre-norm is more stable at large scale and allows training without learning rate warmup tricks.

The Quadratic Problem

Self-attention has O(n²) complexity in sequence length. For a sequence of 1,000 tokens, you’re computing a 1,000×1,000 attention matrix (1M values). For 10,000 tokens: 100M values. For 100,000 tokens: 10B values.

This is why early GPT models had 512 or 1,024 token context windows. It wasn’t a design choice — it was a memory constraint.

Several approaches have attacked this:

Sparse Attention (Longformer, BigBird): Not every token attends to every other token. Use local windows + global tokens. Reduces to O(n). Works well for tasks where long-range dependencies are sparse.

Linear Attention (Performers, cosine attention): Approximate the softmax kernel with a feature map decomposition, so QK^T V can be computed as Q(K^T V) — changing the order of matrix multiplication from O(n²d) to O(nd²). Much faster for long sequences if d < n.

Flash Attention (Dao et al., 2022): Doesn’t reduce complexity, but dramatically reduces memory. By tiling the computation to fit within GPU SRAM (which is 10-100x faster than HBM), Flash Attention computes exact attention without materializing the full n×n matrix on SRAM. Result: 2-4x speedup, 5-20x memory reduction. Flash Attention 2 and 3 pushed this further. This is now the default in virtually every production LLM.

State Space Models (Mamba, RWKV): A different approach entirely — structured state spaces that process sequences in O(n) with no quadratic bottleneck. Competitive with Transformers on many tasks at the 7B scale. Not yet definitively better at scale, but worth watching.

KV Cache: Why Inference Is Manageable

During autoregressive generation (generating one token at a time), the model processes each new token but doesn’t need to recompute keys and values for all previous tokens. The KV cache stores K and V matrices from previous steps:

# At step t:
# Reuse K[0..t-1] and V[0..t-1] from cache
# Only compute new K[t] and V[t] for the new token
# Run attention: Q[t] attends over all cached K/V

Without the KV cache, inference would scale as O(n²) in wall-clock time (recomputing everything for each token). With it, each new token costs O(n·d) — linear in context length. The tradeoff: KV cache for a long context window gets large. For a LLaMA-2-70B model at fp16 with 4,096 context, the KV cache alone is ~5GB per request.

This is why serving long-context LLMs at scale is expensive, and why quantizing the KV cache (GQA — Grouped Query Attention, MLA — Multi-head Latent Attention in DeepSeek) is an active research area.

Decoder-Only vs Encoder-Decoder vs Encoder-Only

The original Transformer was encoder-decoder for machine translation. Three main variants emerged:

Encoder-only (BERT, RoBERTa):

Bidirectional attention — every token sees every other token
Trained with masked language modeling (predict masked tokens)
Best for: classification, NER, question answering (reading comprehension tasks)
Not generative

Decoder-only (GPT series, LLaMA, Claude, Gemini):

Causal (unidirectional) attention — each token only sees previous tokens
Trained with next-token prediction (standard language modeling)
Best for: open-ended generation, coding, reasoning chains
The dominant architecture since GPT-2

Encoder-decoder (T5, BART, original Transformer):

Encoder processes input with full attention; decoder generates output
Trained with sequence-to-sequence objectives
Best for: translation, summarization, structured output tasks
Still used in specialized pipelines

The decoder-only dominance is somewhat surprising — many assumed bidirectional context (encoder) was essential for reasoning. In practice, large-scale next-token prediction on decoder-only models learned bidirectional representations implicitly, and the architecture scaled better.

Emergent Abilities: The Phenomenon Nobody Predicted

Scaling experiments at Google (the original scaling laws paper by Kaplan et al., 2020) established a predictable power-law relationship between model size, data, compute, and loss. What wasn’t predicted: emergent abilities.

Certain capabilities appeared abruptly at scale — not gradual improvement, but near-zero performance below a threshold and near-human performance above it. Examples:

Multi-step arithmetic: ~175B parameters (GPT-3 scale)
Chain-of-thought reasoning: appeared with few-shot prompting + scale
Code generation from docstrings: not useful at 7B, surprisingly capable at 70B

There’s debate about whether emergence is real (genuinely discontinuous) or an artifact of using the wrong metrics (smooth improvement in log-probability masked by discrete evaluation metrics). Either way, the practical observation stands: Transformer capabilities at the frontier have repeatedly exceeded what anyone projected.

Architectural Variants Worth Knowing

Model	Key Architectural Differences
GPT-2/3	Decoder-only, learned positional embeddings
LLaMA 2	RoPE, SwiGLU, GQA (in 70B), no bias terms
Mistral 7B	Sliding window attention, GQA
Gemma	Multi-Query Attention, GeGLU activation
DeepSeek V2	MLA (Multi-head Latent Attention), MoE routing
GPT-4 (rumored)	Mixture-of-Experts (MoE) with 16 expert routing

Mixture-of-Experts (MoE) deserves special mention: instead of activating all FFN parameters for every token, a router selects 2 of 8 (or 16 of 64) expert FFN layers. This allows massive parameter counts with GPT-3-level compute per forward pass. DeepSeek V3 (671B total parameters, ~37B active per token) demonstrated that MoE can match frontier models at significantly lower inference cost.

The Limits of Transformers

A few places where standard Transformers genuinely struggle:

True compositionality. Transformers learn to apply rules they’ve seen in training data. Tasks requiring systematic application of rules to novel combinations — the kind of thing formal grammars do naturally — require scale to approximate.

Length generalization. Models trained on sequences up to 4,096 tokens often degrade on 4,097+ even when the architecture supports it. Positional encoding schemes that generalize to unseen lengths (RoPE with appropriate scaling, ALIBI) partially address this.

Causal reasoning with strict counterfactuals. “If the moon were twice as heavy, would tides be stronger?” Transformers produce plausible-sounding answers, but whether they’re actually reasoning causally or pattern-matching to similar text is unclear.

Algorithmic computation. Transformers are poor at exact multi-step computation (e.g., 47-digit arithmetic, executing a sorting algorithm). Chain-of-thought prompting helps by externalizing intermediate steps into the context — essentially using the sequence as working memory.

One Thing to Remember

The Transformer’s attention mechanism is O(n²) in sequence length, which is the root of nearly every engineering tradeoff in modern LLMs — context window size, inference cost, KV cache memory, and why Flash Attention and MoE architectures matter. Everything else in the frontier model wars is, in some sense, an optimization on top of this one equation: softmax(QKᵀ / √d_k) · V.

techaitransformersattentionnlpdeep-learningarchitecture