Stretching context: YaRN, NTK-by-parts, and attention sinks
RoPE gives you relative positions. It does not give you 128K context for free — the fast-rotating and slow-rotating dimensions each need their own scaling story, and the softmax needs somewhere to dump residual mass. Three regimes, one family of answers, and the mechanism behind every 128K+ open model that shipped in 2024–2026.
Why RoPE doesn't just work at 32K when trained at 4K
In the RoPE lesson we earned the key property: the attention score depends only on the offset . That is a statement about relative position. It is not a statement about extrapolation.
Train a model at 4K tokens. The dim-pair frequencies produce angular trajectories the weights have seen only up to radians. Deploy at 32K and suddenly every key at position sits at rotation phases the model has never observed during training. The softmax over scores was calibrated for the trained angular range; outside it, scores drift off into nonsense. Output collapses almost immediately past the training length.
There is a second, subtler failure. The low-frequency dim pairs (large , tiny ) rotate so slowly that across 4K tokens they only sweep a small arc. Stretch the sequence to 32K without adapting and those pairs still only sweep a small arc — effectively the same phase for a token at position 100 and a token at position 20,000. The model loses its ability to distinguish far-apart positions at the dim pairs that were supposed to encode the coarsest positional scale.
Position Interpolation (Chen 2023) — the first attempt
The simplest thing you can do is linear. Before applying RoPE, rescale every position index by :
For an 8K → 32K extension, . Four real tokens are packed into one virtual position. The model sees only angles in its trained range, but now four physical tokens share each angular slot.
A short fine-tune (a few hundred steps on long sequences) adapts the model to the denser angular grid, and perplexity at the new length drops to near-baseline. This is what Llama 2 used for its 4K → 32K extension. It works. Up to a point.
PI rescales every dim uniformly. The high-frequency pairs — which were already making fine distinctions at the 4K scale — get their angular resolution compressed by the same factor . Past ~2–4×, those pairs can no longer distinguish immediate neighbours, and the model loses its local positional acuity even while gaining global reach. That's the failure visible in the red curve on the hook plot.
NTK-aware — preserve high-frequency resolution
The bloc97 insight: don't scale the position uniformly. Scale the base frequency instead, in a way that leaves the high-freq dims mostly alone and concentrates the interpolation on the low-freq dims. Concretely, replace the base 10,000 with a new base chosen so the lowest-frequency pair still sweeps its full training arc at the new length:
For , — about 4× the original base, which lifts the schedule of the far-right dims just enough to span the new range, while leaving the leftmost dims essentially unchanged.
On the hook plot, that's the amber curve: flat through the first few multipliers, slowly rising past 8×. NTK holds longer than PI because it keeps the high-freq dims' angular resolution intact. But it still relies on a single global recipe, and the transition between regimes is hand-tuned rather than optimised. At 32× it eventually degrades too.
YaRN — per-band scaling plus attention temperature
YaRN (Peng et al. 2023) is the clean formalisation. Instead of one global base-frequency knob, treat the RoPE dim pairs as a continuum and apply different scaling to different frequency bands. The split is set by wavelength thresholds:
- If (high-freq) — interpolate. The dim already completes full rotations within the trained length; squishing it is safe.
- If (low-freq) — extrapolate. The dim never completed a full rotation at trained length; its angles at longer contexts are still within the trained distribution.
- In between — smooth ramp. A cosine or linear transition, so there's no discontinuity at the band boundary.
YaRN adds one more detail that matters enormously at large multipliers: a temperature correction on the attention softmax. As sequences grow, the total mass the softmax distributes is spread across more keys, so the effective entropy rises. YaRN rescales logits by
to compensate — a tiny, closed-form fix that restores the softmax's sharpness at the new length. Without it, attention gets diffuse and the model averages rather than retrieves.
llama3 rope_type) — algorithmically distinct from YaRN but in the same family. The teal curve in the hook plot — flat across the full 32× range — is the shape that drove adoption.Attention sinks — where the softmax dumps its residual mass
Long-context extension solves the positional half of the problem. It does not solve streaming decode. If you want to keep generating past your trained length without accumulating an unbounded KV cache, the obvious move is a sliding window: keep the last keys, evict everything older. Run that for a few hundred tokens and the model's output collapses into garbage.
The fix, discovered by Xiao et al. in September 2023, is almost comically simple: keep the first 4 tokens pinned, no matter how far the window has advanced. They call these attention sinks. Output stays fluent indefinitely.
The phenomenon is visible in the toggle above. With sinks pinned (copper glow on tokens 0–3), the rest of the window slides normally and the model produces fluent text. Evict the sinks and decode degenerates within tokens — every head is desperately trying to find some key to dump residual probability onto, but the first tokens are no longer there.
Production pairings, 2026
Every frontier model shipped in 2025–2026 uses some mixture of the techniques above. The recipes diverge on which combination is easiest to train, not on which is theoretically best — the perplexity curves for YaRN and NTK-with-long-fine-tune converge at modest multipliers, so labs pick what their existing pipelines support.
Trained initially at 4K, then extended via YaRNapplied to MLA's decoupled RoPE slice — progressively 4K → 32K → 128K with roughly 1000 fine-tune steps per stage. Final long-context logit scalar at the last stage's . Sink behaviour inherited from standard attention — BOS always present. Llama 3.1 ships a conceptually similar per-band scheme (Meta's llama3 rope_type, factor 8) rather than literal YaRN.
Dual RoPE bases instead of YaRN: local-attention layers use (1K window), global-attention layers use (128K window). Each layer's base is tuned to its actual context size. Simpler to reason about than YaRN; requires a specific local/global sandwich architecture to work.
DeepSeek-V3(Dec 2024) takes the most layered approach: MLA's decoupled-RoPE slice carries the positional signal on a width-64 channel, and YaRN is applied to that slice alone. The shared latent is un-rotated, so YaRN's ramp only operates on the 64-dim RoPE side. This makes the long-context math independent of the KV-cache compression math — a rare piece of architectural luck.