AWQ vs GPTQ Explained: GPU-Side LLM Quantization

Why K-quants aren't enough for GPU serving

Last lesson we took apart K-quants — llama.cpp's hierarchical super-block scheme, engineered around a CPU cache line and AVX2 dequant loops. Q4_K_M at ~4.85 bits per weight, virtually indistinguishable from FP16 on most benchmarks. If you're running on a laptop, that's the end of the story.

But the production inference stack — vLLM, SGLang, TensorRT-LLM, Marlin kernels — does not run on AVX2. It runs on tensor cores, and tensor cores want a very particular shape of int4 matmul: contiguous 4-bit codes with per-group scales, vectorised through shared memory, dequant fused into the mma instruction. K-quants' eight-level sub-block scales don't map cleanly onto that. You'd spend more time shuffling scale tables than multiplying.

The GPU-side quantization family was built differently. Two methods dominate: AWQ (Lin et al., MIT, 2023) and GPTQ (Frantar et al., IST Austria, 2022). Both target 4-bit weights with an FP16 scale per 128-weight group. Both run close to FP16 throughput on an H100 with the right kernel. They differ in one question: how do you choose which 4-bit grid each weight rounds to?

The salient-weight observation

Here is the unreasonable fact both methods exploit. Trained weights are not uniform. If you histogram the magnitudes of, say, a Llama-2 MLP's down-projection matrix, you see a heavy tail — a few dozen of the $4096$ input channels carry an order of magnitude more signal than the median channel. The AWQ paper measures this directly: keeping just ~1% of weights at full precision (by activation-magnitude rank) and crushing the rest to INT3 recovers most of the perplexity that pure INT3 loses.

weight magnitude · wikitext-2 · llama-2-7b layer 12

~1% of channels carry the tail

Fine-tuned weights are heavy-tailed: most channels are small and tolerate coarse rounding, but a tiny sliver — the tail — carries disproportionate signal. Any honest 4-bit scheme treats that tail differently.

The corollary is mechanical. If you're going to spend 4 bits per weight, you should spend them unequally— coarsely on the channels that don't matter, finely on the channels that do. The two methods in this lesson are two different answers to how to do that without breaking the int4 matmul contract.

AWQ — scale salient channels up, quantize, scale back

AWQ's move is almost embarrassingly simple once you see it. For each linear layer $y = Wx$ , identify the input channels where $|x|$ is large (measured on a few hundred calibration samples). Multiply those columns of $W$ by a per-channel scale $s$ , and divide the corresponding entries of $x$ by the same $s$ . Algebraically identical.

The paper parameterises $s = s_x^{\alpha}$ where $s_x$ is the average activation magnitude per channel and $\alpha \in [0, 1]$ is grid-searched per layer — so the effective $s$ values depend on the activation distribution rather than being a fixed range. For intuition, a typical salient-channel $s$ lands in a modest single-digit factor (the viz uses ~3× as illustration).

the AWQ equivalence

y \;=\; W \, x \;=\; \bigl( W \cdot \text{diag}(s) \bigr) \cdot \bigl( \text{diag}(s)^{-1} x \bigr)

In FP32 this rewrite is a no-op — $s$ cancels. The magic is in what happens to $W \cdot \text{diag}(s)$ when you then quantize it to 4 bits. The scaled-up salient columns now occupy a larger fraction of the 16-bucket grid, so their rounding error is proportionally smaller. After dequant, you divide back by $s$ and the fine-grained structure is preserved. The inverse scale gets folded into the preceding LayerNorm or activation, costing nothing at runtime.

◆ paper

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han · 2023

arxiv:2306.00978

The key observation buried in Section 3.1: salience is not about weight magnitude — it's about activation magnitude at that channel. A weight with small value but multiplied by a big activation contributes more to the output than a big weight on a dead channel. AWQ picks which channels to protect based on

|x|

, not

|W|

The calibration step itself is cheap — AWQ searches over a small grid of per-channel scales (literally, a grid search per layer) to minimise the reconstruction error of the matmul output. 128 samples of a couple of hundred tokens each is enough. On Llama-2-7B the whole AWQ pass takes about 20 minutes on a single A100 and lands at a wikitext-2 perplexity bump of $+0.14$ vs FP16.

GPTQ — minimize per-row error via Hessian-inverse

GPTQ takes a different view. Forget salience rankings — just frame quantization as a reconstruction problem. For each row of $W$ , we want quantized weights $\hat{W}$ that minimise the squared error of the output on a calibration batch:

the GPTQ objective

\min_{\hat{W}} \; \bigl\lVert W X - \hat{W} X \bigr\rVert_F^2

This is a convex quadratic in $\hat{W}$ — but $\hat{W}$ is constrained to the 4-bit grid, which breaks convexity. GPTQ's trick is to solve the problem one column at a time, using the inverse of the calibration Hessian $H = 2XX^\top$ to propagate rounding error across remaining unquantized columns.

Concretely: walk the columns of $W$ left to right. At column $c$ , round its weights to the 4-bit grid. Compute the residual $\delta_c = W_{:,c} - \hat{W}_{:,c}$ . Use $H^{-1}$ to update every column to the right: each remaining column absorbs a small correction that compensates for the error you just committed. Then move on.

the per-column update

W_{:,c+1:} \;\mathrel{+}=\; -\, \frac{\delta_c}{[H^{-1}]_{cc}} \, [H^{-1}]_{c,\, c+1:}

The denominator $[H^{-1}]_{cc}$ tells you how much the output is sensitive to column $c$ specifically. The numerator distributes the committed error across the right-hand columns weighted by their covariance with column $c$ . Columns that co-vary strongly with $c$ on the calibration set pick up more of the correction.

◆ paper

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, Ashkboos, Hoefler, Alistarh · 2022

arxiv:2210.17323

The computational breakthrough: a naive per-column solve would need a fresh Cholesky at every column, at

O(d_\text{in}^3)

each — hopeless. GPTQ shows you can do the whole matrix with one Cholesky up front and a sequence of rank-1 updates, getting total cost to

O(d_\text{in}^3)

once. OPT-175B quantizes in four hours on a single A100.

The character of the result: GPTQ is exactin a local sense. Once you've committed the first $c$ columns, the update rule gives you the globally optimal unquantized remainder conditional on those commits. It's just that the left-most columns were committed without the benefit of seeing what was coming — so their error has to be absorbed somewhere, and that somewhere is the right end of the matrix. Wikitext-2 PPL bump on Llama-2-7B: around $+0.22$ with a 512-sample calibration, roughly matching AWQ in the default regime.

Calibration set size — where the two methods diverge

In the default regime both methods land within a tenth of a perplexity point of each other. The difference shows up at the edges. Try the calibration slider in the viz: with 128 samples, AWQ stays near $+0.17$ , but GPTQ balloons to $+0.40$ . Why?

awq is a rank ordering

AWQ needs to know which channels have large $|x|$ on average. Averages converge fast — 128 calibration samples already give you the top-1% ranking to within a few permutations, and those permutations don't matter because they're all in the tail.

Once the ranking is stable, the per-channel scale is chosen by a grid search minimising a single reconstruction error per layer. Robust.

gptq is a second-order fit

GPTQ needs the full inverse Hessian $H^{-1}$ of the calibration activations. With $d_\text{in} = 4096$ and only 128 samples, $H$ is rank-deficient — the inverse is regularised, poorly conditioned, and overfit to the sample slice.

The corollary: GPTQ's sensitivity to calibration is also its strength. Calibrate on code, and GPTQ on code will slightly beat AWQ on code — because it fit a second-order model of exactly that domain.

This is the practical decision. If your deployment target is open-ended (chat, mixed-domain), AWQ is the safer pick — it degrades gracefully and doesn't care much what you calibrate on. If you're quantizing a specialist model (a code assistant, a medical reasoner) and you have a few thousand aligned samples, GPTQ can land a point or two of perplexity below AWQ on the target domain.

Which does vLLM / SGLang serve?

Both. The HuggingFace Hub has tens of thousands of checkpoints in each format, and every serious inference runtime supports both. The split-by-preference, circa 2026:

AWQdominates broad-deployment serving — the robust default when you don't know the workload. On Ampere GPUs, vLLM's Marlin kernel delivers roughly 4× speedup over FP16 at small batches (≤16–32 tokens), where decode is memory-bound; the speedup tapers off at large batches where the workload becomes compute-bound.
GPTQ stays common where calibration-set alignment is easy — enterprise fine-tunes, domain-specialist models, and the fastest open-weight code models. TensorRT-LLM ships both and lets you pick per deployment.
Kernel-level: Marlin (Ampere-era, int4 matmul for tensor cores) and Machete (Hopper-era, same mixed-input GEMM role on H100 where Marlin's layout underperforms) both accept AWQ and GPTQ checkpoints interchangeably — by the time the weights are on the GPU they're just a 4-bit code with per-group scales. The difference is in how the scales were chosen, not in the serving kernel.

The useful summary: AWQ and GPTQ are not rivals so much as two different answers to the same question — “where does int4 quantization's error go?” AWQ pushes it uniformly across non-salient channels by protecting the salient ones. GPTQ pushes it rightward along the matrix by walking columns and compensating. Either answer is better than naive round-to-nearest. Neither is the right answer for every workload. Pick based on calibration data, not perplexity benchmarks on the canonical wikitext slice.

calibration samplestypical — the AWQ paper's default. Both methods stable.

method

FP16

bits / weight

16bits

ppl bump vs FP16

0.00

FP16 baseline. Every weight is a full 16-bit float; the copper gradient runs smooth because rounding error is below perceptual threshold. Watch what happens when we drop to 4 bits.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What signal does AWQ use to decide which channels are “salient” and need protection during quantization?

Why K-quants aren't enough for GPU serving

The salient-weight observation

weight magnitude · wikitext-2 · llama-2-7b layer 12

~1% of channels carry the tail

AWQ — scale salient channels up, quantize, scale back

the AWQ equivalence

y \;=\; W \, x \;=\; \bigl( W \cdot \text{diag}(s) \bigr) \cdot \bigl( \text{diag}(s)^{-1} x \bigr)

◆ paper

AWQ: Activation-aware Weight Quantization for On-Device LLM Compression and Acceleration

Lin, Tang, Tang, Yang, Chen, Wang, Xiao, Dang, Gan, Han · 2023

arxiv:2306.00978

|x|

, not

|W|

GPTQ — minimize per-row error via Hessian-inverse

the GPTQ objective

\min_{\hat{W}} \; \bigl\lVert W X - \hat{W} X \bigr\rVert_F^2

the per-column update

W_{:,c+1:} \;\mathrel{+}=\; -\, \frac{\delta_c}{[H^{-1}]_{cc}} \, [H^{-1}]_{c,\, c+1:}

◆ paper

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Frantar, Ashkboos, Hoefler, Alistarh · 2022

arxiv:2210.17323

The computational breakthrough: a naive per-column solve would need a fresh Cholesky at every column, at

O(d_\text{in}^3)

each — hopeless. GPTQ shows you can do the whole matrix with one Cholesky up front and a sequence of rank-1 updates, getting total cost to

O(d_\text{in}^3)

once. OPT-175B quantizes in four hours on a single A100.

Calibration set size — where the two methods diverge

awq is a rank ordering

Once the ranking is stable, the per-channel scale is chosen by a grid search minimising a single reconstruction error per layer. Robust.

gptq is a second-order fit

Which does vLLM / SGLang serve?

Both. The HuggingFace Hub has tens of thousands of checkpoints in each format, and every serious inference runtime supports both. The split-by-preference, circa 2026:

AWQdominates broad-deployment serving — the robust default when you don't know the workload. On Ampere GPUs, vLLM's Marlin kernel delivers roughly 4× speedup over FP16 at small batches (≤16–32 tokens), where decode is memory-bound; the speedup tapers off at large batches where the workload becomes compute-bound.
GPTQ stays common where calibration-set alignment is easy — enterprise fine-tunes, domain-specialist models, and the fastest open-weight code models. TensorRT-LLM ships both and lets you pick per deployment.
Kernel-level: Marlin (Ampere-era, int4 matmul for tensor cores) and Machete (Hopper-era, same mixed-input GEMM role on H100 where Marlin's layout underperforms) both accept AWQ and GPTQ checkpoints interchangeably — by the time the weights are on the GPU they're just a 4-bit code with per-group scales. The difference is in how the scales were chosen, not in the serving kernel.

calibration samplestypical — the AWQ paper's default. Both methods stable.

method

FP16

bits / weight

16bits

ppl bump vs FP16

0.00

FP16 baseline. Every weight is a full 16-bit float; the copper gradient runs smooth because rounding error is below perceptual threshold. Watch what happens when we drop to 4 bits.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

Activation-aware quantization: AWQ vs GPTQ

Why K-quants aren't enough for GPU serving

The salient-weight observation

AWQ — scale salient channels up, quantize, scale back

GPTQ — minimize per-row error via Hessian-inverse

Calibration set size — where the two methods diverge

Which does vLLM / SGLang serve?

What signal does AWQ use to decide which channels are “salient” and need protection during quantization?

Activation-aware quantization: AWQ vs GPTQ

Why K-quants aren't enough for GPU serving

The salient-weight observation

AWQ — scale salient channels up, quantize, scale back

GPTQ — minimize per-row error via Hessian-inverse

Calibration set size — where the two methods diverge

Which does vLLM / SGLang serve?

What signal does AWQ use to decide which channels are “salient” and need protection during quantization?