Microscale
0
Act IIInside the Machine
lesson tied-embeddings · 5 min · 30 xp

Tied embeddings

Halve the parameter count with one assignment

Two big matrices that do almost the same job

A transformer begins and ends with an embedding matrix. Up front, the input embedding WERV×dW_E \in \mathbb{R}^{|V| \times d} turns a token ID into a vector. At the output, the unembedding (LM head) WURd×VW_U \in \mathbb{R}^{d \times |V|} turns a vector back into a logit over the vocabulary.

For Phi-4-mini's configuration — V=200,064|V| = 200{,}064 and d=3072d = 3072 — each of those matrices has ~615M parameters. Together they cost about 1.2 billion parametersout of a ~3.8B total. That's 32% of the model doing vocabulary bookkeeping.

tied: WU=WE\text{tied: } W_U = W_E^\top

The tied embedding trick is the simplest optimisation in this whole course: set them equal. Use the transpose of WEW_E as the unembedding. One matrix, half the parameters, no quality drop on any public benchmark (and often a tiny improvement due to the regularization effect).

200k
3072
non-embed params
3.62B
embed params (one matrix)
0.61B
total (tied)
4.24B
saved by tying
0.61B
parameter share of LM head
0%25%50%75%100%
non-embeddingsaved by tying

Who ties, who doesn't

  • Tied: Phi-4-mini, Gemma 3, SmolLM3, many sub-4B models
  • Untied: Llama 3.x (deliberately — they can absorb the cost at their scale)
  • Mixed: some models share them during pretraining and un-tie for fine-tuning

At larger scales (14B+), the embedding matrix becomes a smaller fraction of total params, so untying costs relatively less and the slight capacity advantage of untied can be worth it. For SLMs, though, tying is almost always the right call — a 1B model cannot afford to spend a third of its capacity on vocabulary bookkeeping.

The input embedding and unembedding are solving closely related problems — “what does this token mean?” and “given a meaning vector, which token is closest?”. It's not shocking that the same matrix works for both. What's shocking is that for decades we trained them separately because “that's what the original Transformer did.”
comprehension check
comprehension · 1 / 2

What does the 'tied embeddings' trick do?