Tokenization — Core Concepts
What Tokenization Actually Is
Every language model needs to convert human language into numbers before it can do anything with it. You can’t feed a neural network the string “Hello world” — you need numbers. Tokenization is the process that bridges those two worlds.
A token is the basic unit a model works with. It’s not a character, and it’s usually not a whole word. It’s somewhere in between — a fragment that balances two competing goals:
- Vocabulary shouldn’t be too large (hard to learn, too many parameters)
- Vocabulary shouldn’t be too small (common words split into too many pieces)
The tokenizer is a lookup table. It converts text into a sequence of integer IDs. “The cat sat” might become [464, 3797, 3332]. Everything downstream — attention, generation, the whole model — operates on those integers.
How Modern Tokenizers Work: BPE
The dominant method today is Byte Pair Encoding (BPE), originally a data compression algorithm from 1994 that got repurposed for NLP.
Here’s the gist:
- Start with every character as its own token
- Count the most frequent pair of adjacent tokens
- Merge that pair into a new single token
- Repeat thousands of times
After enough merges, common words like “the” become single tokens. Common word-parts like “ing”, “un-”, “-tion” get their own tokens. Rare or domain-specific words get split into pieces.
GPT-4’s tokenizer (cl100k_base) has a vocabulary of 100,277 tokens. Compare that to GPT-2’s 50,257 — the growth reflects broader multilingual support and better compression efficiency.
The Numbers That Actually Matter
Most people are fuzzy on the token-to-word ratio. Here’s a useful benchmark:
| Language | Tokens per word (approx.) |
|---|---|
| English | ~1.3 |
| Spanish | ~1.5 |
| German | ~1.6 |
| Chinese | ~2–3 |
| Arabic | ~2–3 |
| Code (Python) | ~0.8–1.2 |
English is tokenizer-advantaged. This is a real problem — a user writing in Japanese burns through their context window 2x faster than an equivalent English speaker, for the same conceptual content.
Code is often better than English, because programming keywords like def, for, return appear so frequently they get efficient single tokens.
Common Misconception: One Token = One Word
This causes real confusion. Consider how GPT-4’s tokenizer handles some examples:
"tokenization"→["token", "ization"]— 2 tokens"indistinguishable"→["ind", "isting", "u", "ishable"]— 4 tokens"ChatGPT"→["Chat", "G", "PT"]— 3 tokens" hello"(with a leading space) → different token than"hello"without one
That last one surprises people. A space before a word is often encoded into the token itself. " the" and "the" may be different token IDs. This has practical consequences: if you split on spaces before tokenizing, you’ll get different results than if you don’t.
Why Tokenization Shapes Model Behavior
Most people think tokenization is just an input preprocessing step. It’s actually much more consequential:
Spelling and character-level tasks are hard. Ask an LLM to count the letters in “strawberry” and it’ll frequently fail. It’s not seeing individual characters — it’s seeing ["st", "raw", "berry"] and has to reconstruct character-level reasoning from that. It’s like asking someone to count syllables in a language they’ve never heard.
Numbers get split unpredictably. The number 1234567 might tokenize as ["12", "345", "67"]. This makes arithmetic genuinely difficult at the token level — models have to reassemble number meaning from arbitrary chunks.
Context windows are tokens, not words. When GPT-4 advertises a 128k context window, that’s ~96,000 words. For non-English users, it’s less. Knowing this changes how you structure long prompts.
Language inequality. A 100-token system message in English might take 150+ tokens in Turkish. This affects pricing, context limits, and latency — equally for every API call.
Special Tokens
Beyond regular vocabulary, every model has reserved special tokens that serve structural roles:
<|endoftext|>— marks document boundaries in GPT models[CLS],[SEP],[MASK]— BERT’s structural tokens for classification, sentence boundaries, and masked language modeling<s>,</s>— start/end tokens in many open-source models<|im_start|>,<|im_end|>— instruction/chat formatting in instruction-tuned models
These aren’t part of natural language — they’re scaffolding the model was trained to respect. Manipulating them (either accidentally or intentionally) is one vector for prompt injection attacks.
One thing to remember: Tokenization isn’t neutral. It encodes assumptions about language, efficiency, and what’s “common” based on training data — which was mostly English internet text. Every downstream behavior of the model carries those assumptions forward.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.