Why K-quants aren't enough for GPU serving
Last lesson we took apart K-quants — llama.cpp's hierarchical super-block scheme, engineered around a CPU cache line and AVX2 dequant loops. Q4_K_M at ~4.85 bits per weight, virtually indistinguishable from FP16 on most benchmarks. If you're running on a laptop, that's the end of the story.
But the production inference stack — vLLM, SGLang, TensorRT-LLM, Marlin kernels — does not run on AVX2. It runs on tensor cores, and tensor cores want a very particular shape of int4 matmul: contiguous 4-bit codes with per-group scales, vectorised through shared memory, dequant fused into the mma instruction. K-quants' eight-level sub-block scales don't map cleanly onto that. You'd spend more time shuffling scale tables than multiplying.
The GPU-side quantization family was built differently. Two methods dominate: AWQ (Lin et al., MIT, 2023) and GPTQ (Frantar et al., IST Austria, 2022). Both target 4-bit weights with an FP16 scale per 128-weight group. Both run close to FP16 throughput on an H100 with the right kernel. They differ in one question: how do you choose which 4-bit grid each weight rounds to?
The salient-weight observation
Here is the unreasonable fact both methods exploit. Trained weights are not uniform. If you histogram the magnitudes of, say, a Llama-2 MLP's down-projection matrix, you see a heavy tail — a few dozen of the input channels carry an order of magnitude more signal than the median channel. The AWQ paper measures this directly: keeping just ~1% of weights at full precision (by activation-magnitude rank) and crushing the rest to INT3 recovers most of the perplexity that pure INT3 loses.
The corollary is mechanical. If you're going to spend 4 bits per weight, you should spend them unequally— coarsely on the channels that don't matter, finely on the channels that do. The two methods in this lesson are two different answers to how to do that without breaking the int4 matmul contract.
AWQ — scale salient channels up, quantize, scale back
AWQ's move is almost embarrassingly simple once you see it. For each linear layer , identify the input channels where is large (measured on a few hundred calibration samples). Multiply those columns of by a per-channel scale , and divide the corresponding entries of by the same . Algebraically identical.
The paper parameterises where is the average activation magnitude per channel and is grid-searched per layer — so the effective values depend on the activation distribution rather than being a fixed range. For intuition, a typical salient-channel lands in a modest single-digit factor (the viz uses ~3× as illustration).
In FP32 this rewrite is a no-op — cancels. The magic is in what happens to when you then quantize it to 4 bits. The scaled-up salient columns now occupy a larger fraction of the 16-bucket grid, so their rounding error is proportionally smaller. After dequant, you divide back by and the fine-grained structure is preserved. The inverse scale gets folded into the preceding LayerNorm or activation, costing nothing at runtime.
The calibration step itself is cheap — AWQ searches over a small grid of per-channel scales (literally, a grid search per layer) to minimise the reconstruction error of the matmul output. 128 samples of a couple of hundred tokens each is enough. On Llama-2-7B the whole AWQ pass takes about 20 minutes on a single A100 and lands at a wikitext-2 perplexity bump of vs FP16.
GPTQ — minimize per-row error via Hessian-inverse
GPTQ takes a different view. Forget salience rankings — just frame quantization as a reconstruction problem. For each row of , we want quantized weights that minimise the squared error of the output on a calibration batch:
This is a convex quadratic in — but is constrained to the 4-bit grid, which breaks convexity. GPTQ's trick is to solve the problem one column at a time, using the inverse of the calibration Hessian to propagate rounding error across remaining unquantized columns.
Concretely: walk the columns of left to right. At column , round its weights to the 4-bit grid. Compute the residual . Use to update every column to the right: each remaining column absorbs a small correction that compensates for the error you just committed. Then move on.
The denominator tells you how much the output is sensitive to column specifically. The numerator distributes the committed error across the right-hand columns weighted by their covariance with column . Columns that co-vary strongly with on the calibration set pick up more of the correction.
The character of the result: GPTQ is exactin a local sense. Once you've committed the first columns, the update rule gives you the globally optimal unquantized remainder conditional on those commits. It's just that the left-most columns were committed without the benefit of seeing what was coming — so their error has to be absorbed somewhere, and that somewhere is the right end of the matrix. Wikitext-2 PPL bump on Llama-2-7B: around with a 512-sample calibration, roughly matching AWQ in the default regime.
Calibration set size — where the two methods diverge
In the default regime both methods land within a tenth of a perplexity point of each other. The difference shows up at the edges. Try the calibration slider in the viz: with 128 samples, AWQ stays near , but GPTQ balloons to . Why?
AWQ needs to know which channels have large on average. Averages converge fast — 128 calibration samples already give you the top-1% ranking to within a few permutations, and those permutations don't matter because they're all in the tail.
Once the ranking is stable, the per-channel scale is chosen by a grid search minimising a single reconstruction error per layer. Robust.
GPTQ needs the full inverse Hessian of the calibration activations. With and only 128 samples, is rank-deficient — the inverse is regularised, poorly conditioned, and overfit to the sample slice.
The corollary: GPTQ's sensitivity to calibration is also its strength. Calibrate on code, and GPTQ on code will slightly beat AWQ on code — because it fit a second-order model of exactly that domain.
This is the practical decision. If your deployment target is open-ended (chat, mixed-domain), AWQ is the safer pick — it degrades gracefully and doesn't care much what you calibrate on. If you're quantizing a specialist model (a code assistant, a medical reasoner) and you have a few thousand aligned samples, GPTQ can land a point or two of perplexity below AWQ on the target domain.
Which does vLLM / SGLang serve?
Both. The HuggingFace Hub has tens of thousands of checkpoints in each format, and every serious inference runtime supports both. The split-by-preference, circa 2026:
- AWQdominates broad-deployment serving — the robust default when you don't know the workload. On Ampere GPUs, vLLM's Marlin kernel delivers roughly 4× speedup over FP16 at small batches (≤16–32 tokens), where decode is memory-bound; the speedup tapers off at large batches where the workload becomes compute-bound.
- GPTQ stays common where calibration-set alignment is easy — enterprise fine-tunes, domain-specialist models, and the fastest open-weight code models. TensorRT-LLM ships both and lets you pick per deployment.
- Kernel-level: Marlin (Ampere-era, int4 matmul for tensor cores) and Machete (Hopper-era, same mixed-input GEMM role on H100 where Marlin's layout underperforms) both accept AWQ and GPTQ checkpoints interchangeably — by the time the weights are on the GPU they're just a 4-bit code with per-group scales. The difference is in how the scales were chosen, not in the serving kernel.
The useful summary: AWQ and GPTQ are not rivals so much as two different answers to the same question — “where does int4 quantization's error go?” AWQ pushes it uniformly across non-salient channels by protecting the salient ones. GPTQ pushes it rightward along the matrix by walking columns and compensating. Either answer is better than naive round-to-nearest. Neither is the right answer for every workload. Pick based on calibration data, not perplexity benchmarks on the canonical wikitext slice.