A token is usually not a word
Everyone's mental model of tokenisation is the same at first: “the model reads words”. And for a surprisingly long time, that's good enough. But it isn't quite right. Modern language models read sub-word pieces — fragments somewhere between characters and words, carefully chosen so that common sequences collapse into a single unit while rare or novel ones fall back to smaller pieces.
Why? Two reasons, both practical:
- Open vocabulary. With whole-word tokens, any word the model has never seen becomes
[UNK]. Sub-word tokens decompose gracefully: “unfortunateness” becomes something likeun+fortun+ate+ness, and each of those pieces is already in the model's vocabulary. - Density. Frequent phrases like “the” or “ing” get a single token. The model's context window then covers more meaning per slot than character-level would.
Byte-pair encoding, in one paragraph
The dominant algorithm is called BPE (byte-pair encoding). You start with every character — or every byte — as its own token. Then you count the most frequent adjacent pair in your training corpus, merge it into a single symbol, and repeat. After tens of thousands of rounds you end up with a vocabulary where some entries are whole words (the, and), some are common suffixes (-ing, -ed), and some are rare-but-useful character sequences that serve as a fallback for anything unfamiliar.
o200k_base, Llama's SentencePiece, Gemma's, Qwen's — is a variation on this same 1994 compression algorithm.The real thing — try it yourself
The playground below is not a toy. It runs the actualBPE merge tables used by OpenAI's production models. When you type a sentence, each keystroke runs through the same 200,064-entry vocabulary that GPT-4o uses, and the integer underneath each chip is the real token ID — the number that gets fed into the transformer's embedding lookup.
Four tokenizers are available: GPT-4o (200k vocab, the current frontier), GPT-4 / GPT-3.5 (100k, cl100k_base), GPT-3 (50k, p50k_base), and GPT-2 (50k, the historical baseline from 2019). Try the same sentence through each of them and watch the token count shift. Larger vocabularies produce fewertokens per sentence — they've learned more common phrases as single units.
What to try in the playground
Four quick experiments worth the minute they take:
- Common English:type “the cat sat on the mat”. You'll see 7 tokens in
o200k_base— every word is one token because all of them are in the high-frequency vocabulary. - Something technical: try “transformer” or “backpropagation”. Watch how some words stay as one token while others fragment — the vocabulary has learned these specific technical terms because they appeared often enough in training.
- A rare or invented word: type “unfurnishednesses” or “xkdftw”. BPE gracefully decomposes into smaller and smaller pieces until it hits character-level fragments — this is the fallback that makes it “open vocabulary”.
- Non-English text: try Japanese (“こんにちは”), Arabic, Hindi, or emoji (🚀🎉). The older tokenizers (GPT-2, GPT-3) will explode the character count because they weren't trained on much non-English data. GPT-4o's
o200k_basewas specifically designed to be better at this — try the same Japanese string through GPT-2 and then through GPT-4o and look at the token counts.
Probability over a vocabulary — how big is big?
Once the text is tokenized, the model's job at each step is to output a probability distribution over the entire vocabulary. For a 200,064-token vocabulary, that's a vector of 200,064 positive real numbers that must sum to 1.
The cost of predicting over a big vocabulary is real — it's the last matmul in the whole network (the output projection, sometimes called the “LM head”), and for a ~3B model with a 200k vocab it can easily be 10–15% of total parameters. This is why the tied embeddings trick, which we cover in Act II, is such a valuable optimization — you save the cost of one of the two big vocab-sized matrices by sharing weights.
The distribution is not the answer — it is the material the answer is sampled from. A decoder draws one token from that 200,064-wide vector, appends it to the context, and runs the whole forward pass again for the next position. How it draws matters: argmax (greedy) picks the single highest-probability token every time — deterministic, often dull, and pathologically prone to loops. Temperature divides the logits before softmax, so sharpens the distribution and flattens it. Top-p (nucleus) sampling keeps only the smallest set of tokens whose probabilities sum to (typically 0.9) and renormalises — the point is to cut off the fat tail of 199,000 tokens each with probability that collectively hold 10% of the mass and almost always represent noise. These aren't stylistic knobs; they are the reason the same model with and feels like two different products.