LoRA solves params but not memory
LoRA gets you down to tens of millions of trainable parameters, but the base weightsstill live in VRAM at full precision. A 7B model in FP16 is 14 GB just for weights; add optimizer state, gradients, activations, and you're past a 24 GB consumer GPU before training starts.
QLoRA (Dettmers et al. 2023) finishes the job. Quantize the base weights to 4 bits, keep the LoRA adapter in BF16, dequantize on-the-fly during the forward pass. Memory for the base drops 4×; you can fine-tune a 70B model on a single 48 GB GPU or a 7B on a 12 GB consumer card.
Three tricks that make it work
- 4-bit NormalFloat (NF4) — use a quantile-based 4-bit grid rather than a uniform one.
- Double quantization — quantize the quantization constants themselves, saving ~0.4 bits/weight.
- Paged optimizers — let AdamW states spill to CPU unified memory when VRAM pressure spikes.
The NF4 grid is the most interesting piece — let's look at it.
Why the NF4 grid is information-theoretically optimal for normal data
Trained neural network weights are approximately distributed. Suppose you have only possible values to represent every weight. Where should those 16 values sit?
Information-theoretically, you want each bin to hold equal probability mass. If one bin covers 40% of the weight distribution and another covers 1%, the 40% bin is wasting coding budget on a narrow range and the 1% bin is wasting it on empty territory. Both bins should cover equal mass: .
Equal probability mass means the bins should be denser where the distribution is peaked (near zero) and sparserin the tails. That's exactly what the NF4 grid does — its 16 values are at the quantiles of a unit normal. Each bin holds about 1/16 of the total probability mass; no bins are wasted in empty tails.