Tokenization — Core Concepts

How AI models split text into tokens, why BPE changed everything, and the hidden ways tokenization shapes what an AI can and can't do.

What Tokenization Actually Is

Every language model needs to convert human language into numbers before it can do anything with it. You can’t feed a neural network the string “Hello world” — you need numbers. Tokenization is the process that bridges those two worlds.

A token is the basic unit a model works with. It’s not a character, and it’s usually not a whole word. It’s somewhere in between — a fragment that balances two competing goals:

Vocabulary shouldn’t be too large (hard to learn, too many parameters)
Vocabulary shouldn’t be too small (common words split into too many pieces)

The tokenizer is a lookup table. It converts text into a sequence of integer IDs. “The cat sat” might become [464, 3797, 3332]. Everything downstream — attention, generation, the whole model — operates on those integers.

How Modern Tokenizers Work: BPE

The dominant method today is Byte Pair Encoding (BPE), originally a data compression algorithm from 1994 that got repurposed for NLP.

Here’s the gist:

Start with every character as its own token
Count the most frequent pair of adjacent tokens
Merge that pair into a new single token
Repeat thousands of times

After enough merges, common words like “the” become single tokens. Common word-parts like “ing”, “un-”, “-tion” get their own tokens. Rare or domain-specific words get split into pieces.

GPT-4’s tokenizer (cl100k_base) has a vocabulary of 100,277 tokens. Compare that to GPT-2’s 50,257 — the growth reflects broader multilingual support and better compression efficiency.

The Numbers That Actually Matter

Most people are fuzzy on the token-to-word ratio. Here’s a useful benchmark:

Language	Tokens per word (approx.)
English	~1.3
Spanish	~1.5
German	~1.6
Chinese	~2–3
Arabic	~2–3
Code (Python)	~0.8–1.2

English is tokenizer-advantaged. This is a real problem — a user writing in Japanese burns through their context window 2x faster than an equivalent English speaker, for the same conceptual content.

Code is often better than English, because programming keywords like def, for, return appear so frequently they get efficient single tokens.

Common Misconception: One Token = One Word

This causes real confusion. Consider how GPT-4’s tokenizer handles some examples:

"tokenization" → ["token", "ization"] — 2 tokens
"indistinguishable" → ["ind", "isting", "u", "ishable"] — 4 tokens
"ChatGPT" → ["Chat", "G", "PT"] — 3 tokens
" hello" (with a leading space) → different token than "hello" without one

That last one surprises people. A space before a word is often encoded into the token itself. " the" and "the" may be different token IDs. This has practical consequences: if you split on spaces before tokenizing, you’ll get different results than if you don’t.

Why Tokenization Shapes Model Behavior

Most people think tokenization is just an input preprocessing step. It’s actually much more consequential:

Spelling and character-level tasks are hard. Ask an LLM to count the letters in “strawberry” and it’ll frequently fail. It’s not seeing individual characters — it’s seeing ["st", "raw", "berry"] and has to reconstruct character-level reasoning from that. It’s like asking someone to count syllables in a language they’ve never heard.

Numbers get split unpredictably. The number 1234567 might tokenize as ["12", "345", "67"]. This makes arithmetic genuinely difficult at the token level — models have to reassemble number meaning from arbitrary chunks.

Context windows are tokens, not words. When GPT-4 advertises a 128k context window, that’s ~96,000 words. For non-English users, it’s less. Knowing this changes how you structure long prompts.

Language inequality. A 100-token system message in English might take 150+ tokens in Turkish. This affects pricing, context limits, and latency — equally for every API call.

Special Tokens

Beyond regular vocabulary, every model has reserved special tokens that serve structural roles:

<|endoftext|> — marks document boundaries in GPT models
[CLS], [SEP], [MASK] — BERT’s structural tokens for classification, sentence boundaries, and masked language modeling
<s>, </s> — start/end tokens in many open-source models
<|im_start|>, <|im_end|> — instruction/chat formatting in instruction-tuned models

These aren’t part of natural language — they’re scaffolding the model was trained to respect. Manipulating them (either accidentally or intentionally) is one vector for prompt injection attacks.

One thing to remember: Tokenization isn’t neutral. It encodes assumptions about language, efficiency, and what’s “common” based on training data — which was mostly English internet text. Every downstream behavior of the model carries those assumptions forward.

tokenizationBPELLMsNLPvocabulary