Transformer Architecture — Core Concepts

The Paper That Changed Everything

In June 2017, eight Google researchers published a paper titled “Attention Is All You Need.” Most papers like it get cited a few hundred times and drift into obscurity.

This one has been cited over 100,000 times. It launched GPT, BERT, Gemini, Claude, and every other major AI system you’ve heard of. If there’s a single document you could point to as the origin of the modern AI boom, this is it.

So what did it actually say?

The Problem With Reading Left to Right

Before transformers, AI handled language with something called a recurrent neural network (RNN). The name tells you everything: it was recurrent, meaning it processed words in sequence, one at a time, feeding the output of each step into the next.

This sounds logical — that’s how you’d read a sentence. But it created a brutal problem: long-range dependencies.

Try this sentence: “The keys to the cabinet that the owner of the house kept near the door were missing.”

By the time an RNN reached “were missing,” it had already half-forgotten “keys.” The subject of the sentence — the thing the verb referred to — was buried under too many steps of processing. Like trying to remember a phone number while doing mental math.

Transformers threw out the left-to-right rule entirely.

What Attention Actually Does

The core mechanism is called self-attention. Here’s the intuition without the math:

Every word in a sentence gets to “look at” every other word and decide how relevant it is. The output for each word is a weighted mix of information from all other words — where the weights reflect relevance.

For “The animal didn’t cross the street because it was too tired”:

  • “it” pays high attention to “animal”
  • “it” pays low attention to “street”
  • The model resolves the ambiguity by pulling context from across the sentence

This happens in parallel for all words simultaneously — which is also why transformers train faster than RNNs on modern GPUs (which are built to do exactly this kind of parallel math).

The Building Blocks

A transformer has two main components:

Encoder — reads the input and builds a rich understanding of it. Used when you need to understand text (classification, translation input, summarization input). BERT from Google is encoder-only.

Decoder — generates output one token at a time, attending to both the input and what it’s already generated. GPT models are decoder-only — they predict the next word, over and over.

Most of the original architecture was encoder-decoder (for translation). Modern language models mostly use decoder-only — generating text is fundamentally a prediction problem.

What’s Inside Each Layer

ComponentWhat it does
Self-AttentionWords “look at” each other to gather context
Multi-Head AttentionRun attention multiple times with different “lenses”
Feed-Forward NetworkProcess each position’s output independently
Layer NormalizationStabilize training
Positional EncodingTell the model where each word sits in the sequence

A GPT-3 has 96 layers of this stacked on top of each other. GPT-4 — probably more.

The Positional Encoding Trick

One weird implication of reading everything at once: the model has no idea what order the words are in. “Dog bites man” and “Man bites dog” look identical to pure attention.

The fix is positional encoding — adding a special signal to each word’s representation that encodes its position in the sequence. In the original paper this was done with sine and cosine waves of different frequencies. It sounds arcane, but it works surprisingly well.

Common Misconception: “Transformers Understand Language”

They don’t — at least not the way you do. A transformer learns statistical patterns: which tokens tend to follow which other tokens, in which contexts. It never “reads for meaning” in any human sense.

This is worth remembering when AI confidently says wrong things. It isn’t lying — it’s pattern-matching to something plausible that turns out to be false. See: AI Hallucinations.

Why This Scaled So Well

The timing mattered. Attention mechanisms were known before 2017 — but transformers paired them with something the world had just made cheap: massive parallel compute via GPUs.

Every word attending to every other word is expensive — it scales quadratically with sequence length. But for the sentence lengths researchers cared about in 2017, GPUs could handle it. And as GPU clusters got bigger, models got bigger. The transformer architecture turned out to scale almost predictably: more compute → better models.

No other architecture did that as cleanly.

One thing to remember: Transformers don’t read language the way you do — they read it all at once and figure out which parts matter to which other parts. That parallelism is why they trained faster, scaled further, and eventually became the foundation for every AI product that’s made you go “wait, that actually works.”

aitransformersattentionneural-networksnlp

See Also

  • Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
  • Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
  • Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
  • Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
  • Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.