Tokens and probabilities

A token is usually not a word

Everyone's mental model of tokenisation is the same at first: “the model reads words”. And for a surprisingly long time, that's good enough. But it isn't quite right. Modern language models read sub-word pieces — fragments somewhere between characters and words, carefully chosen so that common sequences collapse into a single unit while rare or novel ones fall back to smaller pieces.

Why? Two reasons, both practical:

Open vocabulary. With whole-word tokens, any word the model has never seen becomes [UNK]. Sub-word tokens decompose gracefully: “unfortunateness” becomes something like un + fortun + ate + ness, and each of those pieces is already in the model's vocabulary.
Density. Frequent phrases like “the” or “ing” get a single token. The model's context window then covers more meaning per slot than character-level would.

Byte-pair encoding, in one paragraph

The dominant algorithm is called BPE (byte-pair encoding). You start with every character — or every byte — as its own token. Then you count the most frequent adjacent pair in your training corpus, merge it into a single symbol, and repeat. After tens of thousands of rounds you end up with a vocabulary where some entries are whole words (the, and), some are common suffixes (-ing, -ed), and some are rare-but-useful character sequences that serve as a fallback for anything unfamiliar.

\text{training corpus} \xrightarrow{\text{count pairs · merge most frequent · repeat}} \text{vocabulary}

MMXXVI

historical note

1994 · Philip Gage

BPE was introduced in a programming journal in 1994 as a data compression technique — nothing to do with language models. Sennrich, Haddow, and Birch re-applied it to neural machine translation in 2015, solving the open-vocabulary problem that had plagued word-level NMT. Every modern language model tokenizer — GPT's o200k_base, Llama's SentencePiece, Gemma's, Qwen's — is a variation on this same 1994 compression algorithm.

◆ paper

Neural Machine Translation of Rare Words with Subword Units

Sennrich, Haddow, Birch · 2015

arxiv:1508.07909

The paper that re-introduced BPE from its 1994 data-compression origins into the NLP world. Every modern tokenizer is a direct descendant.

The real thing — try it yourself

The playground below is not a toy. It runs the actualBPE merge tables used by OpenAI's production models. When you type a sentence, each keystroke runs through the same 200,064-entry vocabulary that GPT-4o uses, and the integer underneath each chip is the real token ID — the number that gets fed into the transformer's embedding lookup.

Four tokenizers are available: GPT-4o (200k vocab, the current frontier), GPT-4 / GPT-3.5 (100k, cl100k_base), GPT-3 (50k, p50k_base), and GPT-2 (50k, the historical baseline from 2019). Try the same sentence through each of them and watch the token count shift. Larger vocabularies produce fewertokens per sentence — they've learned more common phrases as single units.

tokenizer

type any sentence

tokens

…

characters

chars per token

…

loading o200k_base merge tables…

decoding o200k_base…

The ▁ symbol is a visual stand-in for a leading space — this is how tokenizers encode word boundaries. A token that starts with ▁means “this piece is the start of a new word”. The number under each chip is the actual integer vocab IDthat gets fed into the transformer's embedding table. Try typing a rare or non-English word and watch it fragment into small pieces — that's BPE gracefully degrading to sub-word coverage instead of hitting an unknown-token wall. Chips with a E3 81 dashed style are partial UTF-8 bytes— the token doesn't correspond to a standalone character on its own, only in combination with its neighbours. That's exactly what byte-level BPE does to CJK and emoji: it splits one visible character across several tokens.

What to try in the playground

Four quick experiments worth the minute they take:

Common English:type “the cat sat on the mat”. You'll see 7 tokens in o200k_base — every word is one token because all of them are in the high-frequency vocabulary.
Something technical: try “transformer” or “backpropagation”. Watch how some words stay as one token while others fragment — the vocabulary has learned these specific technical terms because they appeared often enough in training.
A rare or invented word: type “unfurnishednesses” or “xkdftw”. BPE gracefully decomposes into smaller and smaller pieces until it hits character-level fragments — this is the fallback that makes it “open vocabulary”.
Non-English text: try Japanese (“こんにちは”), Arabic, Hindi, or emoji (🚀🎉). The older tokenizers (GPT-2, GPT-3) will explode the character count because they weren't trained on much non-English data. GPT-4o's o200k_base was specifically designed to be better at this — try the same Japanese string through GPT-2 and then through GPT-4o and look at the token counts.

The largest tokenizer (GPT-4o's o200k_base) has 200,064 distinct tokens. The smallest (GPT-2) has 50,257. More vocabulary isn't strictly better — bigger vocab means a bigger embedding matrix (vocab × d_model), which eats parameter budget. The design choice is: spend parameters on vocabulary, or spend them on transformer depth? GPT-4o answers “more vocab”; Llama-1 answered “deeper network” with a 32k vocab. Both were defensible choices.

Probability over a vocabulary — how big is big?

Once the text is tokenized, the model's job at each step is to output a probability distribution over the entire vocabulary. For a 200,064-token vocabulary, that's a vector of 200,064 positive real numbers that must sum to 1.

P(\text{token}_{t+1} = v_i \mid \text{token}_{1..t}), \quad \sum_{i=1}^{|V|} P_i = 1

The cost of predicting over a big vocabulary is real — it's the last matmul in the whole network (the output projection, sometimes called the “LM head”), and for a ~3B model with a 200k vocab it can easily be 10–15% of total parameters. This is why the tied embeddings trick, which we cover in Act II, is such a valuable optimization — you save the cost of one of the two big vocab-sized matrices by sharing weights.

The distribution is not the answer — it is the material the answer is sampled from. A decoder draws one token from that 200,064-wide vector, appends it to the context, and runs the whole forward pass again for the next position. How it draws matters: argmax (greedy) picks the single highest-probability token every time — deterministic, often dull, and pathologically prone to loops. Temperature $\tau$ divides the logits before softmax, so $\tau < 1$ sharpens the distribution and $\tau > 1$ flattens it. Top-p (nucleus) sampling keeps only the smallest set of tokens whose probabilities sum to $p$ (typically 0.9) and renormalises — the point is to cut off the fat tail of 199,000 tokens each with probability $10^{-6}$ that collectively hold 10% of the mass and almost always represent noise. These aren't stylistic knobs; they are the reason the same model with $\tau = 0$ and $\tau = 1$ feels like two different products.

Tokenization quality affects everything down the line — perplexity, throughput, context-window utilization, multilingual fairness, training efficiency. A Japanese character costs 2–3 tokens in an English-centric tokenizer but 1 in a multilingual one. For the same 8K context window, one model hears three times more than the other. This is not a minor detail; it is the difference between a model that feels fluent in a language and one that feels like it's reading through a straw.

comprehension check