Transformer Architecture — Deep Dive

Beyond the Intuition

Most explanations of transformers stop at “words pay attention to each other.” That’s true, and it’s useful. But if you want to actually understand what’s going on — what shapes you’re pushing through what operations, what the real bottlenecks are, why certain design choices were made — you need to go deeper.

This is that deeper explanation.

Tokens, Embeddings, and the Input Pipeline

Before anything attention-related happens, text has to become numbers.

Text gets split into tokens — not always words. GPT-4 uses a byte-pair encoding (BPE) tokenizer. “transformer” might be one token. “unbelievable” might be three. The average English word is roughly 1.3 tokens. A token is typically ~4 characters.

Each token maps to an embedding: a vector in a high-dimensional space. For GPT-3, that’s 12,288 dimensions. These embeddings are learned during training — similar concepts end up geometrically close to each other.

The positional encoding gets added to this embedding before it enters the transformer layers. In the original paper, position pos and dimension i got:

PE(pos, 2i)   = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Modern models often use rotary positional embeddings (RoPE, used in LLaMA, Mistral) or ALiBi, which handle longer sequences more gracefully. The original sinusoidal approach works but doesn’t extrapolate well to sequences longer than those seen during training.

Scaled Dot-Product Attention: The Actual Math

Each token’s embedding gets linearly projected into three vectors:

  • Query (Q): what this token is looking for
  • Key (K): what this token is offering
  • Value (V): what this token will contribute if selected

For a sequence of length n with model dimension d_model, the attention operation for a single head with dimension d_k:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Let’s unpack this:

  1. QK^T — a dot product between every query and every key. This produces an n × n matrix of raw attention scores. Two tokens that are semantically related will have high dot products.

  2. / sqrt(d_k) — scaling prevents the dot products from getting too large in high dimensions, which would cause softmax to return near-zero gradients (a saturated softmax distributes weight almost entirely to one position).

  3. softmax(...) — converts scores to probabilities. Each row sums to 1.

  4. * V — weighted sum of value vectors. Each output position gets a combination of all values, weighted by how much attention each got.

That n × n matrix is why attention is O(n²) in sequence length. For n=1000 tokens, that’s a million weights. For n=100,000 tokens (context windows are getting long), it’s 10 billion — per head, per layer.

Multi-Head Attention: Why One Isn’t Enough

The original paper ran attention h times in parallel with different learned projections:

MultiHead(Q, K, V) = Concat(head_1, ..., head_h) * W^O

where head_i = Attention(Q * W_i^Q, K * W_i^K, V * W_i^V)

GPT-3 uses 96 attention heads in its 96-layer model. Each head learns to specialize. Interpretability research (particularly from Anthropic and EleutherAI) has found that some heads reliably track syntax, others co-reference, others implement something like named entity recognition. Nobody explicitly taught them to do this — it emerges from training on prediction.

The “multiple lenses” metaphor is accurate: you genuinely get different heads developing different specializations. Ablation studies show that removing individual heads often degrades specific capabilities while leaving others intact.

The Feed-Forward Layer: What It’s Actually Doing

After attention, each position independently runs through a two-layer MLP:

FFN(x) = max(0, xW_1 + b_1) * W_2 + b_2

The intermediate dimension is typically 4× the model dimension. For GPT-3 with d_model=12288, that’s ~49,000 neurons per position per layer.

Recent interpretability work (Geva et al., 2021; followed up by others) suggests the feed-forward layers function as key-value memories — specific neurons activate for specific input patterns and contribute specific stored “facts” to the output. One neuron might activate strongly for “the first president of” and push the output toward “George Washington.”

This is mechanistically different from attention, which is about relationships between positions. Attention is routing; FFN is fact recall. Both are needed.

Decoder-Only vs Encoder-Decoder vs Encoder-Only

Three major variants emerged from the original architecture:

Encoder-Only (BERT, RoBERTa): Full bidirectional attention — every token sees every other. Great for understanding tasks (classification, NER, semantic search). Can’t generate text. Training objective: masked language modeling (predict randomly masked tokens).

Decoder-Only (GPT series, LLaMA, Mistral, Claude): Causal/autoregressive attention — each token only attends to previous positions. Generates text token by token. Training objective: next-token prediction. The dominant architecture for language generation as of 2025.

Encoder-Decoder (original “Attention Is All You Need,” T5, mT5): Encoder builds a representation; decoder attends to it via cross-attention while generating. Naturally suited for translation and seq2seq tasks. More complex to train and deploy.

The shift toward decoder-only dominance happened around 2019-2020 with GPT-2 scaling results. Next-token prediction on internet-scale data turned out to be a surprisingly powerful general pre-training objective.

KV Cache: The Inference Optimization You Should Know

During inference, autoregressive generation is slow: to generate token n, you need to run attention over all n-1 previous tokens.

But here’s the thing: you already computed the Keys and Values for those previous tokens. They don’t change. The KV cache stores them so you only compute Q, K, V for the new token — then attend to cached K and V for all previous positions.

This is why long contexts are expensive: the KV cache grows linearly with sequence length. GPT-4’s (rumored) 128K context window requires enormous memory just to hold the cache. At fp16 with typical dimensions, a single 100K-token KV cache for a 70B parameter model can exceed 100GB.

This is an active engineering problem. Techniques like multi-query attention (MQA), grouped query attention (GQA, used in LLaMA 3), and quantized KV caches trade some precision for memory savings.

The Quadratic Problem and Alternatives

O(n²) attention is the transformer’s Achilles heel. Industry workarounds:

  • Sliding window attention (Mistral, Longformer): attend only to a local window of tokens + some global tokens. Linear complexity, but long-range dependencies can’t span the window.
  • Flash Attention (Dao et al., 2022): doesn’t reduce theoretical complexity, but rewrites the CUDA kernels to minimize memory bandwidth usage. 2-4× faster in practice. Flash Attention 2 and 3 pushed this further.
  • Linear attention variants: approximate the softmax with a kernel function to get O(n) complexity. So far, these haven’t matched standard attention quality on real benchmarks.
  • Mixture of Experts (MoE): doesn’t fix attention complexity, but scales model capacity without scaling all parameters on every token. GPT-4 is widely believed to use MoE; Mixtral 8x7B is the most prominent open implementation.
  • State Space Models (Mamba, S4): a competing architecture that’s linear in sequence length. Hasn’t displaced transformers yet but is genuinely competitive on certain tasks.

Training: What Actually Happens

Loss function is cross-entropy on next-token prediction:

L = -sum(log P(token_t | token_1, ..., token_{t-1}))

This is computed over the entire training corpus — hundreds of billions of tokens for modern models. Optimization is via Adam or variants (AdamW with weight decay is standard).

Gradient checkpointing is typically necessary — storing all activations for backprop through a 96-layer model at large batch sizes would exhaust any GPU. Instead, forward passes are recomputed on the backward pass in exchange for memory.

Mixed precision training (fp16 or bf16 for forward/backward, fp32 for optimizer state) became standard around 2019. bf16 (bfloat16) is preferred over fp16 for large models because it has the same exponent range as fp32, making it less susceptible to overflow during training.

What the Scaling Laws Actually Said

Kaplan et al. (2020) and Hoffmann et al. (2022, “Chinchilla”) established empirical power laws: model performance on language tasks scales predictably with compute, dataset size, and parameter count.

Chinchilla specifically showed that most models were undertrained: a smaller model trained on more tokens can match a larger model trained on fewer. LLaMA’s success is partly explained by training a 7B model on 1T tokens when the prior wisdom was that 7B models should be trained on ~140B tokens.

This had real industry consequences — companies retooled training runs, dataset collection became as important as architecture, and the emphasis shifted from “biggest model” to “compute-optimal model.”

One thing to remember: The transformer’s O(n²) attention cost isn’t just a footnote — it’s the central constraint shaping every architectural decision in modern AI. Almost every major research direction in 2024-2025 (long context, faster inference, cheaper training) traces back to someone trying to work around it. When you understand that constraint, a lot of the AI landscape suddenly makes sense.

aitransformersattentionneural-networksarchitecturellm

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.