Natural Language Processing — Deep Dive

Tokenization strategies, attention mechanisms, RLHF, and the actual math behind why NLP went from failing at grammar to passing the bar exam — with the implementation details that textbooks skip.

A Short History of Failure

In 1966, MIT’s Joseph Weizenbaum built ELIZA — a program that could hold a conversation by pattern-matching against scripted responses. Users found it unnervingly convincing. Weizenbaum was disturbed by this and spent the rest of his career warning about the dangers of anthropomorphizing machines.

ELIZA had no understanding whatsoever. It was a fancy regex.

The field spent the next 30 years trying to build actual understanding through rule-based systems: formal grammars, parse trees, semantic networks, ontologies. SHRDLU (1970) could discuss blocks in a simulated world with remarkable fluency. Then you’d move it outside its narrow domain and it would collapse. The rules couldn’t scale.

Statistical NLP (1990s–2010s) shifted the paradigm: instead of encoding knowledge, harvest correlations from data. Naïve Bayes spam filters, n-gram language models, and Hidden Markov Models for speech recognition all emerged from this approach. They worked better, but they were brittle in different ways — sensitive to training domain, unable to handle rare events, and completely opaque.

The deep learning revolution arrived around 2012 with image recognition. NLP took a few more years to follow, but when it did — around 2017–2019 — the jump was enormous.

Tokenization: More Subtle Than It Sounds

Before any model sees text, that text must become numbers. The tokenization strategy matters a lot.

Word-level tokenization splits on whitespace. Simple, but the vocabulary explodes (“run”, “runs”, “running”, “ran” are all separate tokens), and out-of-vocabulary words are unhandled.

Character-level tokenization never has OOV problems but creates very long sequences and requires the model to learn spelling from scratch — expensive.

Byte Pair Encoding (BPE) — used by GPT models — starts with characters and iteratively merges the most frequent pairs. After enough merges, common words become single tokens, rare words split into subword units. “Unbelievable” might tokenize as [“un”, “believ”, “able”]. BPE handles multilingual text gracefully because all languages share the same byte-level alphabet as a fallback.

WordPiece — BERT’s approach — works similarly but merges pairs that maximize training corpus likelihood rather than raw frequency.

The practical implication: a 1000-word English paragraph becomes roughly 1300–1500 tokens in GPT-4’s tokenizer. Code, with its unusual character distributions, tokenizes less efficiently — sometimes 2–3× worse than natural English prose. This is why coding tasks eat context windows faster.

Word Embeddings: The Geometric Trick

The fundamental trick underpinning modern NLP is representing words as dense vectors in high-dimensional space such that semantic similarity maps to geometric proximity.

Word2Vec (Google, 2013) was the breakthrough. It trains a shallow neural network on a simple task: predict a word from its context (CBOW) or predict context from a word (Skip-gram). The interesting part is that the task is thrown away — what remains are the weight matrices, which encode something like meaning.

The famous example: vec("King") - vec("Man") + vec("Woman") ≈ vec("Queen"). This isn’t programmed — it emerges from distributional statistics across a large corpus.

GloVe (Stanford, 2014) approached this differently: instead of a predictive model, it factorizes a global word co-occurrence matrix. The resulting embeddings captured different aspects of meaning and were competitive with Word2Vec on most benchmarks.

The problem with static embeddings: “Bank” has one vector regardless of context. The word means different things in “river bank” vs “blood bank” vs “investment bank.” You need contextual embeddings.

Contextualized Representations: ELMo and Then BERT

ELMo (2018) stacked bidirectional LSTMs and used hidden states from multiple layers as the embedding — different layers captured different levels of abstraction (syntax at lower layers, semantics higher up). For each word, you got a representation that depended on its context. Benchmark improvements were significant.

Then BERT (Google, October 2018) arrived and made ELMo look quaint.

BERT (Bidirectional Encoder Representations from Transformers) trains on two tasks simultaneously:

Masked Language Modeling: randomly mask 15% of tokens, train the model to predict them from context in both directions. Unlike GPT (which was left-to-right autoregressive), BERT sees the full sentence before predicting the masked word.
Next Sentence Prediction: given two sentences, predict whether sentence B follows sentence A in the original text. This teaches discourse understanding.

Training data: BooksCorpus (800M words) + English Wikipedia (2.5B words). Training time on 64 TPUs: 4 days. When released, BERT set state-of-the-art on 11 NLP benchmarks simultaneously, in some cases by large margins.

Fine-tuning BERT for a downstream task is straightforward: take the pretrained model, add a task-specific output layer, fine-tune on labeled data for a few epochs. A 2019 Google paper showed you could fine-tune a reasonable sentiment classifier on BERT with as few as 100–1000 labeled examples — far less than training from scratch.

The Transformer Architecture in Detail

The original transformer (Attention Is All You Need, Vaswani et al., 2017) introduced multi-head self-attention as the primary computational mechanism, replacing recurrence entirely.

Self-attention computes, for each token, a weighted sum of all other tokens’ representations — where the weights express how “relevant” each other token is.

For each head, three projections are learned:

Q (Query): what am I looking for?
K (Key): what do I offer to others?
V (Value): what information do I actually contain?

Attention weights for token i over all tokens j:

Attention(Q, K, V) = softmax(QKᵀ / √dₖ) · V

The √dₖ scaling factor prevents the dot products from growing so large that softmax gradients vanish.

Multi-head attention runs this process in parallel with different learned projections, then concatenates results. Each head learns to attend to different relationships — one head might capture syntactic dependencies, another semantic similarity, another positional structure.

Positional encoding: unlike RNNs, transformers have no inherent notion of sequence order. Position is injected via sinusoidal encodings added to the input embeddings. This was a deliberate choice — sinusoidal encodings allow the model to extrapolate to sequence lengths not seen in training, since the encoding is deterministic.

The full transformer block:

Input → Layer Norm → Multi-Head Self-Attention → Residual → Layer Norm → FFN → Residual → Output

The feed-forward network (FFN) is a simple two-layer MLP applied position-wise. It’s often described as the “memory” of the model, where factual associations are stored, while attention handles routing.

Scaling Laws and What They Revealed

In 2020, OpenAI published “Scaling Laws for Neural Language Models.” The finding was blunt: loss decreases predictably as a power law of model size, dataset size, and compute budget. No ceiling in sight.

This implied that you could predict, with reasonable accuracy, how good a model would be before training it — just from parameter count and data. More parameters → better model, reliably.

DeepMind pushed back in 2022 with the Chinchilla paper. The argument: OpenAI and Google had been training models that were too large for their compute budgets. For a given compute budget, you should train a smaller model on more tokens. Chinchilla (70B params, 1.4T tokens) outperformed Gopher (280B params, 300B tokens) despite using less compute. The optimal token-to-parameter ratio is roughly 20:1.

Llama 3 and later models were designed with this in mind — smaller models, very long training runs, better performance than larger predecessors.

RLHF: Why ChatGPT Feels Different from GPT-3

GPT-3 was powerful but chaotic. Ask it to write a helpful email and it might generate a plausible email — or a different perspective on your request — or a short story about email — because it’s a next-token predictor, not a helpful assistant.

Reinforcement Learning from Human Feedback (RLHF) solves this alignment problem.

Phase 1 — Supervised Fine-Tuning (SFT): collect examples of good model behavior (human-written responses to prompts), fine-tune the base model on these. The model learns the style of helpfulness.

Phase 2 — Reward Model Training: present human raters with pairs of model outputs for the same prompt, ask which is better. Use these preferences to train a separate “reward model” that can score any response.

Phase 3 — PPO: use Proximal Policy Optimization (a reinforcement learning algorithm) to fine-tune the language model to maximize scores from the reward model, while staying close to the SFT model (to prevent reward hacking).

The result is a model that’s strongly biased toward responses humans rate as helpful, harmless, and honest — not just plausible continuations of text.

InstructGPT (2022) demonstrated this with a remarkable result: a 1.3B parameter RLHF-tuned model was preferred by human raters over a raw GPT-3 175B model 71% of the time. Alignment methods were worth ~100× in parameter count.

Current Hard Problems

Hallucination remains unsolved at a fundamental level. Language models generate the most likely continuation of a sequence — they don’t have a separate “truth verification” step. The model has no way to know it’s confabulating. Retrieval-augmented generation (RAG) mitigates this by grounding responses in retrieved documents, but doesn’t eliminate it.

Long-context reasoning is still unreliable. Models with 128K+ context windows can technically see hundreds of pages, but performance on tasks requiring reasoning over the full context degrades — the “lost in the middle” problem. Attention over long sequences becomes noisy.

Compositional generalization — can the model correctly handle a novel combination of concepts it’s seen separately? Humans do this effortlessly. Models often fail at systematic recombination.

Multilingual parity is nowhere close. English, Chinese, and Spanish are well-represented in training data. Twi, Mongolian, and dozens of other languages are functionally undertrained. The model can speak them, but performance is significantly degraded.

Implementation Notes

If you’re building an NLP application today:

# Hugging Face is the de facto standard library
from transformers import pipeline

# Sentiment analysis — uses DistilBERT by default
classifier = pipeline("sentiment-analysis")
result = classifier("The service was outstanding but the food arrived cold.")
# → [{'label': 'POSITIVE', 'score': 0.73}]
# Note: aspect-level nuance is lost here

# Named entity recognition
ner = pipeline("ner", aggregation_strategy="simple")
entities = ner("Elon Musk founded SpaceX in 2002 in El Segundo, California.")
# → [{'entity_group': 'PER', 'word': 'Elon Musk', ...},
#    {'entity_group': 'ORG', 'word': 'SpaceX', ...}, ...]

For production applications: sentence-transformers for embeddings (faster than full BERT), pgvector or Pinecone for vector search, and a carefully chosen chunking strategy for RAG. Most of the engineering effort in a real NLP system is in data cleaning and prompt engineering, not model selection.

One Thing to Remember

The transformer’s self-attention mechanism — looking at all tokens in relation to all other tokens simultaneously — is what broke the scaling ceiling that had constrained NLP for decades. Everything since 2017 is a consequence of that one architectural choice.

techainlptransformersbertgptattentionembeddingsrlhf