K-quants super-blocks

Quantization is 99% about the grid

A quantization scheme answers one question: “I have a fixed number of bits per weight — which values should those bits represent?” Naive approaches use a uniform grid across the whole tensor. This wastes capacity — different slices of a tensor have different distributions, so one scale factor across all of them leaves bins empty in some regions and overloaded in others.

The K-quant family(llama.cpp's workhorse) uses a hierarchical super-block structure: each tensor is split into 256-weight super-blocks, each super-block is split into eight 32-weight sub-blocks, and each sub-block gets its own scale and zero-point. The super-block then stores a single FP16 scale factor that multiplies into everything.

bits/weight

4.85bpw

7B file size

4.1GB

ppl penalty

+0.054

one super-block: 256 weights, 8 sub-blocks

Each of the 8 rows is a 32-weight sub-block. Tile shade encodes the weight's magnitude within its sub-block (darker = larger). Next to each row, scale=4b is the per-sub-block multiplier and min=4b is the per-sub-block zero-point — both stored in 4 bits. The FP16 super-block scale at the bottom multiplies everything above, recovering full dynamic range without spending 16 bits per sub-block.

sub 1

scale=4b, min=4b

sub 2

scale=4b, min=4b

sub 3

scale=4b, min=4b

sub 4

scale=4b, min=4b

sub 5

scale=4b, min=4b

sub 6

scale=4b, min=4b

sub 7

scale=4b, min=4b

sub 8

scale=4b, min=4b

+ super-block scale: one FP16 value covering all 256 weights

Why the hierarchy matters

Per-sub-block scales mean each 32-weight region can be tuned to its local dynamic range. A region with mostly tiny weights gets a small scale and packs those weights densely into the 4-bit codes. A region with a few big outliers gets a large scale that accommodates them without truncation.

The super-block scale on top provides a second, coarseadjustment that's only stored once per 256 weights. It lets the whole super-block slide up or down in magnitude without rescaling all the sub-blocks. The net effective bit rate for Q4_K_M is about $4.85$ bits per weight — 4 for the code, 4 for the sub-block scale (averaged over 32 weights = 0.125/weight), and the FP16 super-block overhead amortised across 256 weights.

This hierarchical design is why K-quants hit much better perplexity at 4 bits than the original uniform Q4_0 scheme. Q4_K_M is the universal sensible default for local consumer deployment. Q5_K_M if you have memory to spare. Q6_K is essentially indistinguishable from FP16 on almost every task.

Two numbers that make the trade-off concrete. On Llama-2-7B with wikitext-2, FP16 perplexity is about 5.68. Q4_0 (single scale per 32-weight block, no hierarchy) lands near 5.96 — a $+0.28$ penalty that is visible on summarization and code. Q4_K_M lands at ~5.73, roughly $+0.05$ — below measurement noise on most benchmarks — while shrinking the file from 13.5 GB to 4.1 GB. The delta is almost entirely the hierarchy: Q4_K_M spends ~0.85 extra bits/weight on sub-block scales to avoid wasting 4-bit codes on dead bin ranges. The “M” suffix is another mechanism — mixed precision, where attention and feed-forward weights that are more sensitive (token embeddings, output.weight, the value projections) are bumped up to Q6_K while the rest stays at Q4_K. llama.cpp picks this mix per tensor based on a sensitivity heuristic derived from the importance matrix. And the 256-weight super-block size is not arbitrary: it is the largest power of two that keeps the sub-block scale table inside a single AVX2/NEON register load, so dequant is vectorised without spilling.