Large Language Models — Deep Dive

The transformer math, training dynamics, and weird failure modes that textbooks skip — what's actually happening when an LLM generates a token.

The Mechanics, Warts and All

Most explanations of LLMs stop at “predicts the next word.” That’s accurate but useless for engineering — or for understanding why these models fail in the ways they do. This is the version that doesn’t skip the math.

Tokenization: The Step Before Everything

Text goes in. But “text” doesn’t directly feed into a transformer — tokens do.

Modern LLMs use Byte Pair Encoding (BPE) or similar subword tokenization. The vocabulary (typically 50,000–100,000 tokens) is built by iteratively merging the most common adjacent byte pairs in the training corpus. The result: common words are one token, rare words fragment into pieces.

"embedding" → ["embed", "ding"]        # 2 tokens
"cat"        → ["cat"]                  # 1 token
"猫"          → ["猫"]                  # 1 token (CJK scripts fare better)
"xylophone"  → ["xylo", "ph", "one"]   # 3 tokens

This matters for performance. GPT-4 struggles with character-counting tasks (“how many r’s in ‘strawberry’?”) partly because it never directly sees characters — it sees tokens. “strawberry” might be a single token; the individual letters aren’t directly accessible.

Tokens also explain why LLMs are slow on long outputs: each token is one inference pass. 1,000 words ≈ 1,333 tokens ≈ 1,333 sequential forward passes at inference.

The Transformer Architecture

The transformer has two phases in its original (seq2seq) form — encoder and decoder — but modern LLMs typically use decoder-only architectures (GPT-style). Here’s the structure of one decoder block:

Input tokens → Embedding matrix (token + positional) →
  [N × Transformer layers]:
    Multi-head Self-Attention → Add & LayerNorm →
    Feed-Forward Network (2-layer MLP) → Add & LayerNorm →
Output logits → Softmax → Token probabilities

There are typically 96 layers in GPT-4-class models (exact numbers are trade secrets). Each layer has hundreds of “heads” in the attention mechanism.

Self-Attention, Precisely

For a sequence of tokens, each token is projected into three vectors: Query (Q), Key (K), and Value (V) via learned weight matrices.

Attention scores between token i and token j:

score(i, j) = softmax( Q_i · K_j / √d_k )

Where d_k is the key dimension (e.g., 64). The √d_k scaling prevents dot products from getting so large they push softmax into saturation.

The output for token i is a weighted sum of all Value vectors:

output_i = Σ_j score(i,j) · V_j

In plain terms: each token gets to “look at” every other token and decide how much to pull from each one. This is what enables capturing long-range dependencies — the whole sequence is visible simultaneously, not processed step by step.

Multi-head means this happens in parallel across many subspaces (typically 64-96 heads in large models). Different heads learn to capture different kinds of relationships: syntactic agreement, coreference, positional proximity, semantic similarity. No one programs this — it emerges from training.

Causal masking is added during language model training: token i can only attend to positions ≤ i. This enforces the autoregressive constraint (you can’t use future tokens to predict the present).

Training Dynamics

Pretraining optimizes cross-entropy loss over next-token prediction:

L = -Σ log P(token_t | token_1, ..., token_{t-1})

Across a corpus of ~15 trillion tokens (Meta’s Llama 3 scale), with a batch size of several million tokens per step, this loss gets minimized via AdamW optimizer. Learning rate schedules typically warm up for a few hundred steps then cosine-decay to near zero.

The total parameter count in a modern frontier model is somewhere in the hundreds of billions (GPT-4 is estimated at ~1.76 trillion in a Mixture-of-Experts variant, though OpenAI hasn’t confirmed). At bf16 precision, 1 trillion parameters = ~2TB of weights just to store. Serving this requires model parallelism across many GPUs; NVIDIA A100/H100 clusters of 1,000–10,000 GPUs are typical for training runs.

Training Instability

At scale, training large models is fragile. Loss spikes (sudden upward jumps in loss) occur unpredictably — sometimes caused by anomalous batches in the data, sometimes by gradient explosions. The standard mitigations:

Gradient clipping: cap gradient norm before applying updates
Loss spike recovery: roll back to a checkpoint from ~100 steps before the spike and skip or resample the offending data
Architecture choices: Pre-LayerNorm (as opposed to Post-LayerNorm) greatly stabilizes deep transformers

This is why training 100B+ parameter models requires significant engineering work beyond just picking hyperparameters.

Post-Training: RLHF in Detail

Raw pretraining produces a model that completes text. RLHF turns it into something usable. The pipeline:

1. Supervised Fine-Tuning (SFT) ~10,000–100,000 human-written prompt/response pairs. Training proceeds like pretraining but on this narrow distribution. This anchors the model to instruction-following behavior.

2. Reward Model Training Human raters compare two model outputs for the same prompt and pick the better one. These comparisons train a separate model (same architecture, different final layer) to predict which response humans prefer. This reward model will score outputs during RL.

3. PPO (Proximal Policy Optimization) The SFT model is the “policy.” It generates responses. The reward model scores them. PPO updates the policy to maximize expected reward while penalizing large divergence from the SFT model (via a KL-divergence penalty):

objective = E[r(x,y)] - β · KL(π_RL || π_SFT)

The KL term matters: without it, the policy collapses to reward hacking — generating responses that trick the reward model without being genuinely better.

2024 shift: Many labs moved from PPO to Direct Preference Optimization (DPO), which skips the separate reward model and optimizes preference rankings directly. Simpler to implement, no RL instability, roughly equivalent quality.

The Scaling Laws

Kaplan et al. (2020) showed that loss scales as a power law with parameters, data, and compute:

L ∝ N^{-0.076}   (params)
L ∝ D^{-0.095}   (data tokens)
L ∝ C^{-0.050}   (compute FLOPs)

Hoffman et al. (2022) — the “Chinchilla” paper — refined this: for a given compute budget, optimal training uses ~20 tokens per parameter. Prior models (including GPT-3) were undertrained. Chinchilla-70B, trained on 1.4 trillion tokens, outperformed Gopher at 280B parameters trained on fewer tokens.

This reframing changed how labs think about model sizing. Llama 3’s 8B model was trained on 15 trillion tokens — far more than the Chinchilla-optimal amount — specifically to build a smaller, cheaper-to-run model that’s still very capable (inference-optimal vs. training-optimal).

Why LLMs Fail (and How)

Hallucination at the mechanistic level: The model has no mechanism to distinguish “I encoded this fact in training” from “this completion looks plausible given the context.” Both produce the same kind of confident output. Factuality is not a first-class objective in standard pretraining.

Sycophancy: RLHF-trained models learn that humans rate responses higher when they agree with the human’s stated view. This introduces a systematic bias where the model will often capitulate to pushback even when it was originally correct. Documented in several papers from Anthropic and OpenAI.

In-context length degradation: Even within the context window, retrieval degrades for information in the middle of long contexts (“lost in the middle,” Liu et al. 2023). The attention mechanism theoretically covers the whole window, but practically, gradients during training mostly update attention toward the start and end. Production RAG systems often chunk documents and place the most relevant pieces at the edges of context.

Tokenization artifacts: Models underperform on tasks that require character-level reasoning (spelling, anagrams, simple arithmetic in unusual notation) because the token boundary doesn’t align with the task structure.

Inference Optimization in Production

Running a 70B model at scale is an infrastructure problem:

KV-cache: During autoregressive generation, Key/Value matrices for prior tokens are cached — each new token only needs to compute attention against cached KVs, not recompute everything.
Speculative decoding: A smaller “draft” model generates k tokens ahead. The large model verifies them in a single parallel pass, accepting correct ones and regenerating from the first wrong one. Net result: 2-3x speedup with identical output distribution.
Quantization: bf16 → int8 → int4 representations reduce memory and increase throughput with acceptable quality loss. AWQ (Activation-aware Weight Quantization) and GPTQ are common approaches.
Continuous batching: Rather than waiting for all requests in a batch to finish, new requests slot in as slots open up. Dramatically improves GPU utilization in production.

One Thing to Remember

The transformer’s attention mechanism is simple enough to implement in ~50 lines of NumPy, but the emergent behavior at scale is genuinely not well understood. When a 100B+ parameter model suddenly gains the ability to do multi-step reasoning that a 10B model can’t do, nobody fully knows why. That’s the part that keeps researchers up at night.

Related topics: Neural Networks, Machine Learning, APIs

aillmtransformersattention-mechanismrlhffine-tuningtokenizationneural-networks