Tokenization — Deep Dive
The Tokenization Landscape
Most practitioners work with tokenization as a black box. You call tokenizer.encode(), you get integers, you move on. But the choices made inside that box have surprising downstream effects — on training stability, multilingual performance, adversarial robustness, and even what kinds of reasoning the model can do.
There are four main subword tokenization algorithms in widespread use. They look similar on the surface but have meaningfully different properties.
Algorithm 1: Byte Pair Encoding (BPE)
Used by: GPT-2, GPT-3, GPT-4, RoBERTa, BART, Llama, Mistral
Original paper: Sennrich et al., 2015 (“Neural Machine Translation of Rare Words with Subword Units”)
The training procedure:
# Pseudocode for BPE training
vocab = set of all characters in corpus + special end-of-word token
while vocab_size < target:
pairs = count_adjacent_pairs(corpus)
best = max(pairs, key=pairs.get)
vocab.add(best[0] + best[1])
corpus = replace_all(corpus, best, merged_token)
Key property: BPE is deterministic and greedy. Given a trained vocabulary, encoding is done with a simple left-to-right greedy scan. This makes inference fast and encoding stable.
Byte-level BPE (used by GPT-2 onwards) operates on raw bytes rather than Unicode characters. This guarantees complete vocabulary coverage — any input, in any language or encoding, can be represented. There are no UNK tokens in byte-level BPE. The tradeoff is that rare characters become multiple tokens (a Chinese character might be 2–3 byte tokens), but the vocabulary ceiling stays clean at 256 base bytes + merged pairs.
Algorithm 2: WordPiece
Used by: BERT, DistilBERT, ELECTRA
WordPiece starts from the same character-level foundation but uses a different merge criterion. Instead of counting pair frequency, it maximizes the likelihood of the training data under the language model:
score(A, B) = freq(AB) / (freq(A) × freq(B))
This penalizes merging common tokens together. If “the” and “cat” are both very frequent, WordPiece won’t merge them even if “the cat” appears often — because the individual tokens already explain the data well.
Practical difference: WordPiece tends to produce more linguistically coherent subwords. BPE is more frequency-mechanical. In practice the outputs look similar for English, diverge more for morphologically complex languages.
WordPiece also uses a ## prefix convention to mark continuation subwords. "playing" → ["play", "##ing"]. This makes it explicit which tokens are word-internal vs. word-initial, giving the model extra structural information.
Algorithm 3: Unigram Language Model
Used by: T5, ALBERT, mBART, XLNet (via SentencePiece)
Unigram LM works backwards from BPE. Instead of starting small and merging, it starts with a large candidate vocabulary and prunes it.
- Initialize with all substrings up to some length
- For each candidate token, compute how much the overall training corpus log-likelihood would drop if you removed it
- Remove the X% of tokens with smallest impact
- Repeat until target vocab size
This produces a probabilistic tokenizer. For any input string, multiple tokenizations are valid, and the tokenizer picks the most probable one. During training, you can sample from this distribution — a regularization technique called subword regularization that makes models more robust to tokenization variation.
The stochastic property is a genuine advantage. A model trained with subword regularization sees “playing” as both ["play", "ing"] and ["pl", "aying"] across training examples, making it less brittle to unusual tokenizations at inference time.
Algorithm 4: SentencePiece
Not an algorithm itself — SentencePiece is a library (Kudo & Richardson, 2018) that implements both BPE and Unigram LM. Its distinguishing feature is treating text as raw unicode without pre-tokenization.
Standard BPE assumes you’ve already split on whitespace. SentencePiece doesn’t. It processes the character stream directly, which means:
- Whitespace is encoded as a token (
▁prefix, U+2581) - No language-specific preprocessing needed
- Works identically for Chinese, Arabic, Thai, and languages without word boundaries
This is why SentencePiece dominates multilingual models. Llama 2’s tokenizer uses SentencePiece BPE with a 32k vocabulary. Google’s T5 and mT5 use SentencePiece Unigram.
What Tiktoken Does Differently
OpenAI’s tiktoken (used for GPT-3.5+, GPT-4, text-embedding-ada-002) is byte-level BPE with one significant implementation difference: it uses regex pre-tokenization to split text before BPE runs.
# From tiktoken's cl100k_base pattern
PAT = r"""'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]*|\s*[\r\n]|\s+(?!\S)|\s+"""
This regex splits on language letters, numbers (max 3 digits), whitespace, and special cases. The effect: numbers are always split into max 3-digit chunks. "12345" becomes ["123", "45"] — never ["1", "2", "3", "4", "5"] and never ["12345"].
This was a deliberate engineering decision to improve arithmetic. Three-digit chunks correspond to how humans chunk numbers (thousands, millions, etc.) and how basic arithmetic works. It doesn’t fully solve numeric reasoning, but it’s better than the chaos of arbitrary splits.
The Context Window Math
Engineers routinely need to estimate token counts without running the full tokenizer. Rule of thumb calibration for cl100k_base (GPT-4):
English prose: ~4 chars/token (or ~0.75 words/token)
Python code: ~3.5 chars/token
JSON: ~3 chars/token
HTML: ~2.5 chars/token
Repeated tokens: much worse
The extreme case: a string of spaces. Each space is its own token. A prompt with 100 trailing spaces wastes 100 tokens. This matters for prompt injection attacks that try to push important instructions outside the model’s attention window.
For embedding APIs, this is purely a cost question. For generative APIs, it affects both cost and quality — late context gets less attention in long sequences (the “lost in the middle” phenomenon documented by Liu et al., 2023).
Tokenization as an Attack Surface
The boundary between tokenizer and model creates exploitable seams.
Homoglyph attacks: The Unicode character а (Cyrillic a, U+0430) looks identical to a (Latin a, U+0061) but tokenizes differently. Prompt injections can exploit this to bypass pattern-matching defenses that operate on decoded text.
Token boundary manipulation: Some safety filters work on decoded strings. But the model reasons on tokens. A word split across a safety-relevant boundary ("bomb" → ["bo", "mb"]) may not trigger string-matching filters while still being “understood” by the model.
Vocabulary exploitation: Some token IDs map to unusual strings that confuse the model’s behavior. The “SolidGoldMagikarp” phenomenon (discovered by Rumbelow & Watkins, 2023) found that certain token IDs in GPT-2/3 were never present in training — they exist in the vocabulary but were filtered out of training data. Prompting the model to repeat these tokens produced undefined behavior: gibberish, topic switching, or refusals.
The Future: Are We Moving Beyond Discrete Tokens?
Tokenization is a convenient hack. It compresses input efficiently, but it introduces a string of problems:
- Inconsistent behavior for character-level tasks
- Language inequality in vocabulary efficiency
- Spelling errors that are actually tokenization artifacts
- Numeric reasoning challenges
Three active directions:
Megabyte models (Yu et al., 2023, Meta): Operate directly on bytes. A small local model predicts the next byte, a large global model handles long-range context. No tokenizer at all. Showed competitive performance at 10x lower tokenization cost. Not yet deployed at scale.
RWKV and linear attention approaches: These architectures reduce the cost of very long context, which changes the economics of byte-level modeling. If processing 10x more tokens is cheap enough, tokenization’s compression benefit matters less.
Learned tokenizers during training: A few papers (Ziegler et al., 2022) have explored learning the tokenization jointly with the model, rather than training a separate BPE offline. The tokenizer specializes to the downstream task. This hasn’t reached production yet.
For now, BPE and SentencePiece remain dominant. But the field consensus is shifting: the current generation of tokenizers is an engineering compromise, not an optimal solution. Whatever replaces them will probably look very different.
One thing to remember: A tokenizer trained on 2020 internet data will never optimally serve a 2026 model fine-tuned for chemistry papers or legal contracts. The vocabulary mismatch is a genuine source of degradation that doesn’t show up in standard benchmarks — but domain experts notice it immediately.
See Also
- Ai Hallucinations ChatGPT sometimes makes up facts with total confidence. Here's the weird reason why — and why it's not as simple as 'the AI lied.'
- Artificial Intelligence What is AI really? Think of it as a dog that learned tricks — impressive, but it doesn't know why it's doing them.
- Bias Variance Tradeoff The fundamental tension in machine learning between being wrong in the same way vs. being wrong in different ways — and why the simplest model isn't always best.
- Deep Learning Why your phone can spot your face in a messy photo album — and why that trick comes from practice, not magic.
- Embeddings How do computers know that 'dog' and 'puppy' mean almost the same thing? They don't read definitions — they turn words into secret map coordinates, and nearby coordinates mean nearby meanings.