Transformer Architecture — Core Concepts

The architecture behind every major AI — explained without PhD-level math. Why 'attention' matters, what encoders do, and why 2017 was the year AI grew up.

The Paper That Changed Everything

In June 2017, eight Google researchers published a paper titled “Attention Is All You Need.” Most papers like it get cited a few hundred times and drift into obscurity.

This one has been cited over 100,000 times. It launched GPT, BERT, Gemini, Claude, and every other major AI system you’ve heard of. If there’s a single document you could point to as the origin of the modern AI boom, this is it.

So what did it actually say?

The Problem With Reading Left to Right

Before transformers, AI handled language with something called a recurrent neural network (RNN). The name tells you everything: it was recurrent, meaning it processed words in sequence, one at a time, feeding the output of each step into the next.

This sounds logical — that’s how you’d read a sentence. But it created a brutal problem: long-range dependencies.

Try this sentence: “The keys to the cabinet that the owner of the house kept near the door were missing.”

By the time an RNN reached “were missing,” it had already half-forgotten “keys.” The subject of the sentence — the thing the verb referred to — was buried under too many steps of processing. Like trying to remember a phone number while doing mental math.

Transformers threw out the left-to-right rule entirely.

What Attention Actually Does

The core mechanism is called self-attention. Here’s the intuition without the math:

Every word in a sentence gets to “look at” every other word and decide how relevant it is. The output for each word is a weighted mix of information from all other words — where the weights reflect relevance.

For “The animal didn’t cross the street because it was too tired”:

“it” pays high attention to “animal”
“it” pays low attention to “street”
The model resolves the ambiguity by pulling context from across the sentence

This happens in parallel for all words simultaneously — which is also why transformers train faster than RNNs on modern GPUs (which are built to do exactly this kind of parallel math).

The Building Blocks

A transformer has two main components:

Encoder — reads the input and builds a rich understanding of it. Used when you need to understand text (classification, translation input, summarization input). BERT from Google is encoder-only.

Decoder — generates output one token at a time, attending to both the input and what it’s already generated. GPT models are decoder-only — they predict the next word, over and over.

Most of the original architecture was encoder-decoder (for translation). Modern language models mostly use decoder-only — generating text is fundamentally a prediction problem.

What’s Inside Each Layer

Component	What it does
Self-Attention	Words “look at” each other to gather context
Multi-Head Attention	Run attention multiple times with different “lenses”
Feed-Forward Network	Process each position’s output independently
Layer Normalization	Stabilize training
Positional Encoding	Tell the model where each word sits in the sequence

A GPT-3 has 96 layers of this stacked on top of each other. GPT-4 — probably more.

The Positional Encoding Trick

One weird implication of reading everything at once: the model has no idea what order the words are in. “Dog bites man” and “Man bites dog” look identical to pure attention.

The fix is positional encoding — adding a special signal to each word’s representation that encodes its position in the sequence. In the original paper this was done with sine and cosine waves of different frequencies. It sounds arcane, but it works surprisingly well.

Common Misconception: “Transformers Understand Language”

They don’t — at least not the way you do. A transformer learns statistical patterns: which tokens tend to follow which other tokens, in which contexts. It never “reads for meaning” in any human sense.

This is worth remembering when AI confidently says wrong things. It isn’t lying — it’s pattern-matching to something plausible that turns out to be false. See: AI Hallucinations.

Why This Scaled So Well

The timing mattered. Attention mechanisms were known before 2017 — but transformers paired them with something the world had just made cheap: massive parallel compute via GPUs.

Every word attending to every other word is expensive — it scales quadratically with sequence length. But for the sentence lengths researchers cared about in 2017, GPUs could handle it. And as GPU clusters got bigger, models got bigger. The transformer architecture turned out to scale almost predictably: more compute → better models.

No other architecture did that as cleanly.

One thing to remember: Transformers don’t read language the way you do — they read it all at once and figure out which parts matter to which other parts. That parallelism is why they trained faster, scaled further, and eventually became the foundation for every AI product that’s made you go “wait, that actually works.”

aitransformersattentionneural-networksnlp