The Token Tax · Microscale

the aha moment

Load four real BPE tokenizers — o200k, cl100k, p50k, gpt2 — feed them the same sentence in five languages, and watch the token-per-word ratio climb from 1× in English to 4-5× in Hindi on GPT-2. The fairness gap stops being theory and becomes a number you measured.

Open in Colab View on GitHub

the facts

Time: 30 min
Hardware: CPU · Colab
Act: I · The Landscape
Status: Live
Artifact: A tokenizer × language heatmap and an interactive HTML chart you can share.

run it locally

Clone the labs repo and run this lab as a script or open it as a notebook:

git clone https://github.com/iqbal-sk/Microscale-labs.git
cd Microscale
just setup-auto      # auto-detects CPU / CUDA / Mac
just run 01
# or:  jupyter lab labs/01-token-tax/lab.py

Full install options (uv, pip, or the platform-specific CUDA paths) are in the labs README.

read alongside

Lesson · 6 min · 25 xp

Tokens and probabilities

What a language model really outputs

Open in Colab View on GitHub ← all labs