act ii · inside the machine · lesson

Stretching context: YaRN, NTK-by-parts, and attention sinks

RoPE gives you relative positions. It does not give you 128K context for free — the fast-rotating and slow-rotating dimensions each need their own scaling story, and the softmax needs somewhere to dump residual mass. Three regimes, one family of answers, and the mechanism behind every 128K+ open model that shipped in 2024–2026.

per-dimension wavelength strip · 16 RoPE pairs

pick a regime. stretch the context. watch which dims bend.

preserved (extrapolate)

smooth ramp

squished (interpolate)

context-length multiplier8×

training length → deployment length · 1× is the unmodified model

regime

YARN

multiplier

8×

extrapolated dims

5/ 16

interpolated dims

7/ 16

perplexity delta vs context multiplier

PI — uniform squish

NTK-aware

YaRN — by-parts

At 8× with YARN, the model's perplexity rises by 0.13 over its trained-length baseline. YaRN keeps the curve nearly flat. The copper-coloured columns on the right stay untouched — those are the low-freq dims, extrapolating through the full window.

attention sinks · streaming-LLM finding

the first four tokens are load-bearing

With the first 4 KV entries pinned, the softmax always has somewhere to dump residual probability. Decode can roll indefinitely — the window slides over tokens 4…N while the sinks stay anchored. Output stays fluent.

Why RoPE doesn't just work at 32K when trained at 4K

In the RoPE lesson we earned the key property: the attention score $\langle R_m Q, R_n K \rangle$ depends only on the offset $n - m$ . That is a statement about relative position. It is not a statement about extrapolation.

Train a model at 4K tokens. The dim-pair frequencies $\theta_i = 10000^{-2i/d}$ produce angular trajectories the weights have seen only up to $4096 \cdot \theta_i$ radians. Deploy at 32K and suddenly every key at position $n > 4096$ sits at rotation phases the model has never observed during training. The softmax over scores was calibrated for the trained angular range; outside it, scores drift off into nonsense. Output collapses almost immediately past the training length.

There is a second, subtler failure. The low-frequency dim pairs (large $i$ , tiny $\theta_i$ ) rotate so slowly that across 4K tokens they only sweep a small arc. Stretch the sequence to 32K without adapting and those pairs still only sweep a small arc — effectively the same phase for a token at position 100 and a token at position 20,000. The model loses its ability to distinguish far-apart positions at the dim pairs that were supposed to encode the coarsest positional scale.

MMXXVI

historical note

June 2023 · Kaiokendev · bloc97 · Reddit r/LocalLLaMA

The NTK-aware trick was not a paper. It was a Reddit post. An anonymous researcher using the handle bloc97 posted “NTK-Aware Scaled RoPE” on r/LocalLLaMA — four paragraphs and a code snippet — and within a week the entire open-source community had adopted it. The intuition was borrowed from Neural Tangent Kernel theory: high-frequency components of a function generalise worse than low-frequency components, so interpolate the high-freq dims and extrapolate the low-freq dims. Peng et al. formalised the idea into YaRN three months later.

Position Interpolation (Chen 2023) — the first attempt

The simplest thing you can do is linear. Before applying RoPE, rescale every position index by $\lambda = L_\text{train} / L_\text{target}$ :

m \;\to\; \lambda m, \qquad \lambda = \frac{L_\text{train}}{L_\text{target}}

For an 8K → 32K extension, $\lambda = 1/4$ . Four real tokens are packed into one virtual position. The model sees only angles in its trained range, but now four physical tokens share each angular slot.

A short fine-tune (a few hundred steps on long sequences) adapts the model to the denser angular grid, and perplexity at the new length drops to near-baseline. This is what Llama 2 used for its 4K → 32K extension. It works. Up to a point.

the problem PI ignores

PI rescales every dim uniformly. The high-frequency pairs — which were already making fine distinctions at the 4K scale — get their angular resolution compressed by the same factor $\lambda$ . Past ~2–4×, those pairs can no longer distinguish immediate neighbours, and the model loses its local positional acuity even while gaining global reach. That's the failure visible in the red curve on the hook plot.

◆ paper

Extending Context Window of Large Language Models via Position Interpolation

Chen, Wong, Chen, Tian · 2023

arxiv:2306.15595

The paper that earned the technique a name. Shows that any RoPE-based model can be linearly interpolated to ~4× its trained length with a 1000-step fine-tune. Beyond that, quality degrades in a way that cannot be fixed by more fine-tuning — the information has been squished out.

NTK-aware — preserve high-frequency resolution

The bloc97 insight: don't scale the position uniformly. Scale the base frequency instead, in a way that leaves the high-freq dims mostly alone and concentrates the interpolation on the low-freq dims. Concretely, replace the base 10,000 with a new base $b'$ chosen so the lowest-frequency pair still sweeps its full training arc at the new length:

b' \;=\; 10000 \cdot s^{\,d/(d-2)}

For $s = 4, d = 128$ , $b' \approx 40{,}970$ — about 4× the original base, which lifts the $\theta_i$ schedule of the far-right dims just enough to span the new range, while leaving the leftmost dims essentially unchanged.

On the hook plot, that's the amber curve: flat through the first few multipliers, slowly rising past 8×. NTK holds longer than PI because it keeps the high-freq dims' angular resolution intact. But it still relies on a single global recipe, and the transition between regimes is hand-tuned rather than optimised. At 32× it eventually degrades too.

YaRN — per-band scaling plus attention temperature

YaRN (Peng et al. 2023) is the clean formalisation. Instead of one global base-frequency knob, treat the RoPE dim pairs as a continuum and apply different scaling to different frequency bands. The split is set by wavelength thresholds:

\text{wavelength}_i \;=\; 2\pi / \theta_i

If $\text{wavelength}_i < L_\text{train}$ (high-freq) — interpolate. The dim already completes full rotations within the trained length; squishing it is safe.
If $\text{wavelength}_i > L_\text{train}$ (low-freq) — extrapolate. The dim never completed a full rotation at trained length; its angles at longer contexts are still within the trained distribution.
In between — smooth ramp. A cosine or linear transition, so there's no discontinuity at the band boundary.

YaRN adds one more detail that matters enormously at large multipliers: a temperature correction on the attention softmax. As sequences grow, the total mass the softmax distributes is spread across more keys, so the effective entropy rises. YaRN rescales logits by

t \;=\; 0.1 \ln(s) + 1

to compensate — a tiny, closed-form fix that restores the softmax's sharpness at the new length. Without it, attention gets diffuse and the model averages rather than retrieves.

◆ paper

YaRN: Efficient Context Window Extension of Large Language Models

Peng, Quesnelle, Fan, Shippole · 2023

arxiv:2309.00071

The production standard for open-weight long-context extension. DeepSeek-V3 uses YaRN on top of MLA's decoupled RoPE slice (progressively 4K → 32K → 128K with ~1000 fine-tune steps per stage). Qwen 2.5 uses YaRN for its long-context variants. Llama 3.1 uses a conceptually similar per-band frequency-scaling scheme (Meta calls its recipe llama3 rope_type) — algorithmically distinct from YaRN but in the same family. The teal curve in the hook plot — flat across the full 32× range — is the shape that drove adoption.

Attention sinks — where the softmax dumps its residual mass

Long-context extension solves the positional half of the problem. It does not solve streaming decode. If you want to keep generating past your trained length without accumulating an unbounded KV cache, the obvious move is a sliding window: keep the last $W$ keys, evict everything older. Run that for a few hundred tokens and the model's output collapses into garbage.

The fix, discovered by Xiao et al. in September 2023, is almost comically simple: keep the first 4 tokens pinned, no matter how far the window has advanced. They call these attention sinks. Output stays fluent indefinitely.

◆ paper

Efficient Streaming Language Models with Attention Sinks

Xiao, Tian, Chen, Han, Lewis · 2023

arxiv:2309.17453

The StreamingLLM paper. The finding is that every attention head learns to dump its “leftover” softmax mass onto the first few tokens of the sequence — independent of what those tokens actually say. Evict them and the softmax has no relief valve.

The phenomenon is visible in the toggle above. With sinks pinned (copper glow on tokens 0–3), the rest of the window slides normally and the model produces fluent text. Evict the sinks and decode degenerates within tokens — every head is desperately trying to find some key to dump residual probability onto, but the first tokens are no longer there.

Production pairings, 2026

Every frontier model shipped in 2025–2026 uses some mixture of the techniques above. The recipes diverge on which combination is easiest to train, not on which is theoretically best — the perplexity curves for YaRN and NTK-with-long-fine-tune converge at modest multipliers, so labs pick what their existing pipelines support.

deepseek-v3 · 2024-12

Trained initially at 4K, then extended via YaRNapplied to MLA's decoupled RoPE slice — progressively 4K → 32K → 128K with roughly 1000 fine-tune steps per stage. Final long-context logit scalar $\sqrt{1/t} \approx 1.28$ at the last stage's $s$ . Sink behaviour inherited from standard attention — BOS always present. Llama 3.1 ships a conceptually similar per-band scheme (Meta's llama3 rope_type, factor 8) rather than literal YaRN.

gemma 3 · google · 2025-03

Dual RoPE bases instead of YaRN: local-attention layers use $\text{base} = 10{,}000$ (1K window), global-attention layers use $\text{base} = 1{,}000{,}000$ (128K window). Each layer's base is tuned to its actual context size. Simpler to reason about than YaRN; requires a specific local/global sandwich architecture to work.

DeepSeek-V3(Dec 2024) takes the most layered approach: MLA's decoupled-RoPE slice carries the positional signal on a width-64 channel, and YaRN is applied to that slice alone. The shared latent $c^{KV}$ is un-rotated, so YaRN's ramp only operates on the 64-dim RoPE side. This makes the long-context math independent of the KV-cache compression math — a rare piece of architectural luck.

MMXXVI

historical note

October 2024 · Ding, Zhang, Wang et al. · Microsoft Research Asia

LongRoPE pushed extension further still: a non-uniform frequency search that discovers per-dim scaling factors via evolutionary optimisation rather than the YaRN ramp. Reached 2M tokens on Llama-2 7B with only 1K fine-tuning steps. The technique is more expensive to set up than YaRN — you have to run the search — but produces better perplexity at 128K+ multipliers. Used selectively in 2025–2026 models when the target is truly extreme context.

◆ paper

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Ding, Zhang, Wang, et al. · 2024

arxiv:2402.13753

Practitioners' note: for most fine-tuning workflows in 2026, YaRN is still the default — it's one config block, it composes cleanly with FlashAttention and RadixAttention, and the published recipes are well-tested. Reach for LongRoPE only when YaRN has stopped working for your multiplier. For serving, always pair long-context extension with attention sinks in sliding- window decode paths; Aaron's 2025 write-up amaarora.github.io/posts/2025-09-21-rope-context-extension is the clearest end-to-end tutorial.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

A RoPE-trained model at 4K deployed raw at 32K (no scaling, no fine-tune) typically fails because:

act ii · inside the machine · lesson

Stretching context: YaRN, NTK-by-parts, and attention sinks

per-dimension wavelength strip · 16 RoPE pairs

pick a regime. stretch the context. watch which dims bend.

preserved (extrapolate)

smooth ramp

squished (interpolate)

context-length multiplier8×

training length → deployment length · 1× is the unmodified model

regime

YARN

multiplier

8×

extrapolated dims

5/ 16

interpolated dims

7/ 16

perplexity delta vs context multiplier

PI — uniform squish

NTK-aware

YaRN — by-parts

attention sinks · streaming-LLM finding

the first four tokens are load-bearing

Why RoPE doesn't just work at 32K when trained at 4K

MMXXVI

historical note

June 2023 · Kaiokendev · bloc97 · Reddit r/LocalLLaMA

Position Interpolation (Chen 2023) — the first attempt

The simplest thing you can do is linear. Before applying RoPE, rescale every position index by $\lambda = L_\text{train} / L_\text{target}$ :

m \;\to\; \lambda m, \qquad \lambda = \frac{L_\text{train}}{L_\text{target}}

the problem PI ignores

◆ paper

Extending Context Window of Large Language Models via Position Interpolation

Chen, Wong, Chen, Tian · 2023

arxiv:2306.15595

NTK-aware — preserve high-frequency resolution

b' \;=\; 10000 \cdot s^{\,d/(d-2)}

YaRN — per-band scaling plus attention temperature

\text{wavelength}_i \;=\; 2\pi / \theta_i

If $\text{wavelength}_i < L_\text{train}$ (high-freq) — interpolate. The dim already completes full rotations within the trained length; squishing it is safe.
If $\text{wavelength}_i > L_\text{train}$ (low-freq) — extrapolate. The dim never completed a full rotation at trained length; its angles at longer contexts are still within the trained distribution.
In between — smooth ramp. A cosine or linear transition, so there's no discontinuity at the band boundary.

t \;=\; 0.1 \ln(s) + 1

to compensate — a tiny, closed-form fix that restores the softmax's sharpness at the new length. Without it, attention gets diffuse and the model averages rather than retrieves.

◆ paper

YaRN: Efficient Context Window Extension of Large Language Models

Peng, Quesnelle, Fan, Shippole · 2023

arxiv:2309.00071

Attention sinks — where the softmax dumps its residual mass

◆ paper

Efficient Streaming Language Models with Attention Sinks

Xiao, Tian, Chen, Han, Lewis · 2023

arxiv:2309.17453

Production pairings, 2026

deepseek-v3 · 2024-12

gemma 3 · google · 2025-03

MMXXVI

historical note

October 2024 · Ding, Zhang, Wang et al. · Microsoft Research Asia

◆ paper

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Ding, Zhang, Wang, et al. · 2024

arxiv:2402.13753

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3