Quantization is 99% about the grid
A quantization scheme answers one question: “I have a fixed number of bits per weight — which values should those bits represent?” Naive approaches use a uniform grid across the whole tensor. This wastes capacity — different slices of a tensor have different distributions, so one scale factor across all of them leaves bins empty in some regions and overloaded in others.
The K-quant family(llama.cpp's workhorse) uses a hierarchical super-block structure: each tensor is split into 256-weight super-blocks, each super-block is split into eight 32-weight sub-blocks, and each sub-block gets its own scale and zero-point. The super-block then stores a single FP16 scale factor that multiplies into everything.
scale=4b is the per-sub-block multiplier and min=4b is the per-sub-block zero-point — both stored in 4 bits. The FP16 super-block scale at the bottom multiplies everything above, recovering full dynamic range without spending 16 bits per sub-block.Why the hierarchy matters
Per-sub-block scales mean each 32-weight region can be tuned to its local dynamic range. A region with mostly tiny weights gets a small scale and packs those weights densely into the 4-bit codes. A region with a few big outliers gets a large scale that accommodates them without truncation.
The super-block scale on top provides a second, coarseadjustment that's only stored once per 256 weights. It lets the whole super-block slide up or down in magnitude without rescaling all the sub-blocks. The net effective bit rate for Q4_K_M is about bits per weight — 4 for the code, 4 for the sub-block scale (averaged over 32 weights = 0.125/weight), and the FP16 super-block overhead amortised across 256 weights.
This hierarchical design is why K-quants hit much better perplexity at 4 bits than the original uniform Q4_0 scheme. Q4_K_M is the universal sensible default for local consumer deployment. Q5_K_M if you have memory to spare. Q6_K is essentially indistinguishable from FP16 on almost every task.
Two numbers that make the trade-off concrete. On Llama-2-7B with wikitext-2, FP16 perplexity is about 5.68. Q4_0 (single scale per 32-weight block, no hierarchy) lands near 5.96 — a penalty that is visible on summarization and code. Q4_K_M lands at ~5.73, roughly — below measurement noise on most benchmarks — while shrinking the file from 13.5 GB to 4.1 GB. The delta is almost entirely the hierarchy: Q4_K_M spends ~0.85 extra bits/weight on sub-block scales to avoid wasting 4-bit codes on dead bin ranges. The “M” suffix is another mechanism — mixed precision, where attention and feed-forward weights that are more sensitive (token embeddings, output.weight, the value projections) are bumped up to Q6_K while the rest stays at Q4_K. llama.cpp picks this mix per tensor based on a sensitivity heuristic derived from the importance matrix. And the 256-weight super-block size is not arbitrary: it is the largest power of two that keeps the sub-block scale table inside a single AVX2/NEON register load, so dequant is vectorised without spilling.