RoPE as rotation

Attention is permutation-invariant — we have to fix that

Attention as you learned it in the first Act II lesson does not care about the order of tokens. If you shuffle a sentence, the attention scores rearrange to match, and you get exactly the same output. That's useless for language — “dog bites man” is not “man bites dog”. Position has to be injected somehow.

MMXXVI

historical note

2021 · Jianlin Su et al., RoFormer team

Before RoPE, transformers added learned absolute positional embeddings to token embeddings at the input layer. This worked but had three problems: the embeddings were fixed to training sequence length, long contexts required retraining from scratch, and position information had to survive every downstream layer to still be useful at the top. Su et al. proposed applying position as a literal rotation of query and key vectors inside the attention mechanism. The result was elegant, cheap, and strictly better on every long-context benchmark. Llama, Qwen, Phi, Gemma, and SmolLM all use RoPE; learned absolute embeddings have essentially vanished from modern production models.

◆ paper

RoFormer: Enhanced Transformer with Rotary Position Embedding

Su, Lu, Pan, Murtadha, Wen, Liu · 2021

arxiv:2104.09864

The original RoPE paper. The construction is fully geometric and relies on a beautiful property of rotation matrices that we'll derive in the next few paragraphs.

The construction — one pair at a time

Treat each pair of adjacent feature dimensions as a 2D vector. For a token at position $m$ , rotate that pair by angle $m\theta$ for some frequency $\theta$ . Concretely, if the pair is $(q_x, q_y)$ :

R_m \begin{pmatrix} q_x \\ q_y \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_x \\ q_y \end{pmatrix}

Do this to every Q and every K in the attention layer before computing dot products. In a real $d$ -dimensional transformer the $d$ -dim vectors are carved into $d/2$ pairs, each rotated with a different frequency $\theta_i$ . The geometry is still just 2D rotations — stacked.

query position m3

key position n8

base frequency θ0.40

real RoPE uses tiny θ for high dims, big θ for low dims

⟨R_m Q, R_n K⟩

-1.196

⟨Q, R_{n-m} K⟩

-1.196

notice — both readouts are identical. That's the offset-invariance proof in live numbers.

The offset-invariance property — derivation

Rotations have a beautiful algebraic property that is the whole reason RoPE works. The inner product between two rotated vectors depends only on the difference of the rotation angles, not on their absolute values.

offset invariance, line by line

\langle R_m Q,\, R_n K \rangle

= (R_m Q)^\top (R_n K)

= Q^\top R_m^\top R_n K

= Q^\top R_{n-m} K

The last step uses two facts: rotations compose ( $R_a R_b = R_{a+b}$ ) and rotations are orthogonal ( $R_a^\top = R_{-a}$ ), so $R_m^\top R_n = R_{-m} R_n = R_{n-m}$ .

So the attention score between a query at position $m$ and a key at position $n$ depends only on the relative offset $n - m$ . This is exactly what you want for language — a verb's dependence on its subject depends on the distance between them, not on where they happen to sit in the document. Absolute positions cancel cleanly.

Multi-dimensional — frequency schedule

The 2D picture is a teaching lie — a useful one. The real thing operates on pairs of dimensions across the full $d$ -dimensional Q and K. Each pair $(2i, 2i+1)$ for $i = 0, 1, \ldots, d/2 - 1$ is rotated by its own frequency $\theta_i$ . The original paper uses a geometric schedule:

\theta_i \;=\; \text{base}^{-2i/d}, \quad \text{base} = 10000

Plot this against $i$ : low-index pairs have big $\theta$ and rotate fast with position. High-index pairs have tiny $\theta$ and rotate slowly. The idea: different pairs encode position at different time scales.

RoPE frequency schedule · d = 64

base = 10,000 (RoFormer default)

base = 10⁶ (Gemma 3 global layers)

Both schedules drop off quickly — most pairs have tiny θ. The base controls how quickly the drop happens, and therefore how much positional resolution the late-index pairs retain.

Context extension — YaRN and position interpolation

You trained your model at 4k context. You want to deploy it at 32k. Two things can go wrong:

The positions beyond 4k rotate to angles the model has never seen. Output degenerates immediately past training length.
Even if the model tolerates the new angles, the far-apart tokens land on rotation phases that aliasing-wise look like nearby tokens. The model can't tell a token 20k away from one 2k away.

The solution family is called context extension. Three practical methods:

Why RoPE is applied to Q and K but not V

This is a subtle but important point. The attention mechanism has three things: query, key, value. RoPE rotates Q and K only. V is left alone. Why?

We want the attention score (which depends on the Q–K dot product) to encode relative position. We do not want the retrieved content(which is V) to be position-dependent, because V carries the actual information being retrieved. If you rotated V too, the retrieved content would be a position-scrambled version of the value — the mechanism would retrieve “what's at position $n$ relative to $m$ ” but pass through the value post-rotated by an irrelevant angle.

Position lives only in the attention score computation. Content lives in V, free of position. This clean separation is one of the properties that makes RoPE composable with long-context tricks like sliding windows and FlashAttention — the value pipeline never cares about positions.

comprehension check