Attention is permutation-invariant — we have to fix that
Attention as you learned it in the first Act II lesson does not care about the order of tokens. If you shuffle a sentence, the attention scores rearrange to match, and you get exactly the same output. That's useless for language — “dog bites man” is not “man bites dog”. Position has to be injected somehow.
The construction — one pair at a time
Treat each pair of adjacent feature dimensions as a 2D vector. For a token at position , rotate that pair by angle for some frequency . Concretely, if the pair is :
Do this to every Q and every K in the attention layer before computing dot products. In a real -dimensional transformer the -dim vectors are carved into pairs, each rotated with a different frequency . The geometry is still just 2D rotations — stacked.
The offset-invariance property — derivation
Rotations have a beautiful algebraic property that is the whole reason RoPE works. The inner product between two rotated vectors depends only on the difference of the rotation angles, not on their absolute values.
The last step uses two facts: rotations compose () and rotations are orthogonal (), so .
So the attention score between a query at position and a key at position depends only on the relative offset . This is exactly what you want for language — a verb's dependence on its subject depends on the distance between them, not on where they happen to sit in the document. Absolute positions cancel cleanly.
Multi-dimensional — frequency schedule
The 2D picture is a teaching lie — a useful one. The real thing operates on pairs of dimensions across the full -dimensional Q and K. Each pair for is rotated by its own frequency . The original paper uses a geometric schedule:
Plot this against : low-index pairs have big and rotate fast with position. High-index pairs have tiny and rotate slowly. The idea: different pairs encode position at different time scales.
Context extension — YaRN and position interpolation
You trained your model at 4k context. You want to deploy it at 32k. Two things can go wrong:
- The positions beyond 4k rotate to angles the model has never seen. Output degenerates immediately past training length.
- Even if the model tolerates the new angles, the far-apart tokens land on rotation phases that aliasing-wise look like nearby tokens. The model can't tell a token 20k away from one 2k away.
The solution family is called context extension. Three practical methods:
Why RoPE is applied to Q and K but not V
This is a subtle but important point. The attention mechanism has three things: query, key, value. RoPE rotates Q and K only. V is left alone. Why?
We want the attention score (which depends on the Q–K dot product) to encode relative position. We do not want the retrieved content(which is V) to be position-dependent, because V carries the actual information being retrieved. If you rotated V too, the retrieved content would be a position-scrambled version of the value — the mechanism would retrieve “what's at position relative to ” but pass through the value post-rotated by an irrelevant angle.
Position lives only in the attention score computation. Content lives in V, free of position. This clean separation is one of the properties that makes RoPE composable with long-context tricks like sliding windows and FlashAttention — the value pipeline never cares about positions.