Quantization Changes Both Math and Data Movement
Model weights are originally trained in high precision (FP32 or BF16) — each value takes 2 or 4 bytes. Quantization converts weights to a lower-precision representation, using fewer bits per value. The motivations are practical: a 7B-parameter model in FP16 is ~14 GB; in 4-bit quantization it is ~3.5 GB. Smaller models fit in less memory and move through the memory bus faster.
But quantization is not free compression. It changes two things at once: data layout and arithmetic. The weights are stored in a different format (e.g., groups of values sharing a scale factor). The kernels that multiply these weights must understand this format — they dequantize on the fly or use specialized low-precision integer arithmetic. Dequantizing int8 to FP16 at computation time is often faster than loading pre-dequantized FP16 from memory, because the bottleneck during decode is memory bandwidth, not compute (as you learned in L28). Different quantization schemes trade off size, speed, and quality: Q4_0 uses 4 bits per value (~4× smaller than FP16, more quality loss), Q8_0 uses 8 bits (~2× smaller, less quality loss), and Q4_K_M uses a smarter blocking scheme that preserves more quality at 4-bit precision.
As an example, Q8_0 stores each block of 32 weights as 32 int8 values plus one FP16 scale factor. To reconstruct a weight: multiply the int8 value by the block's scale. This is fast to dequantize and preserves most of the original precision, but only saves ~50% compared to FP16.
Q8_0 encoding of a block of 4 weights (real blocks are 32):
Values are close but not identical to the originals. Lower-bit formats (Q4) lose more precision.
For a realistic block of 32 weights: 1 FP16 scale (2 bytes) + 32 int8 values (32 bytes) = 34 bytes. Compared to 64 bytes in FP16, this is a 47% savings. At Q4_0 (4-bit), 32 weights fit in 2 + 16 = 18 bytes — a 72% savings.
Recall from L28 and L33 that decode is memory-bound: the bottleneck is loading weight data, not computing with it. Quantization directly attacks this bottleneck by reducing the bytes per weight:
- FP16 → Q8_0: Weights are ~2× smaller. Memory traffic for loading weights drops ~2×. Decode speed improves roughly proportionally.
- FP16 → Q4_0: Weights are ~4× smaller. Decode speed can improve up to ~4× (with some overhead for dequantization arithmetic).
Prefill benefits less because it is compute-bound — the weights are reused across many token rows, so loading them is not the bottleneck. Quantization still helps prefill by reducing cache pressure and allowing larger models to fit in memory, but the speedup is smaller than for decode.
This asymmetry is why many practitioners quantize aggressively: the decode phase (which determines token generation speed that users directly experience) benefits the most.
In llama.cpp, quantization formats are defined in ggml/src/ggml-common.h. Each format (Q4_0, Q4_K_M, Q8_0, etc.) has a block structure, and corresponding dequantize and dot-product kernels. Model conversion uses convert_hf_to_gguf.py to produce quantized .gguf files.
Quantization improves memory-bound operations (like decode GEMV) significantly because it reduces bytes moved through the memory bus. But it also changes the arithmetic: dequantization has a cost, and specialized integer dot-product instructions may be faster or slower than floating-point equivalents depending on the ISA. The speed gain is real but the output quality changes — always validate after quantizing.
Why can quantization affect both speed and output quality?
A Q8_0 block has scale = 0.02 and stores the int8 value 64 for one weight. What is the dequantized floating-point value? And if the original weight was 1.30, what is the quantization error?