L35

Quantization Changes Both Math and Data Movement

16 min

Question

What does quantization actually change?

Intuition

Model weights are originally trained in high precision (FP32 or BF16) — each value takes 2 or 4 bytes. Quantization converts weights to a lower-precision representation, using fewer bits per value. The motivations are practical: a 7B-parameter model in FP16 is ~14 GB; in 4-bit quantization it is ~3.5 GB. Smaller models fit in less memory and move through the memory bus faster.

But quantization is not free compression. It changes two things at once: data layout and arithmetic. The weights are stored in a different format (e.g., groups of values sharing a scale factor). The kernels that multiply these weights must understand this format — they dequantize on the fly or use specialized low-precision integer arithmetic. Dequantizing int8 to FP16 at computation time is often faster than loading pre-dequantized FP16 from memory, because the bottleneck during decode is memory bandwidth, not compute (as you learned in L28). Different quantization schemes trade off size, speed, and quality: Q4_0 uses 4 bits per value (~4× smaller than FP16, more quality loss), Q8_0 uses 8 bits (~2× smaller, less quality loss), and Q4_K_M uses a smarter blocking scheme that preserves more quality at 4-bit precision.

As an example, Q8_0 stores each block of 32 weights as 32 int8 values plus one FP16 scale factor. To reconstruct a weight: multiply the int8 value by the block's scale. This is fast to dequantize and preserves most of the original precision, but only saves ~50% compared to FP16.

Toy Example

Q8_0 encoding of a block of 4 weights (real blocks are 32):

Original FP16: [0.12, -0.47, 0.83, -0.05]

Max absolute value: 0.83

Scale: 0.83 / 127 = 0.00654

Quantized int8: [18, -72, 127, -8] (round(value / scale))

Stored: scale(FP16) + 4 × int8 = 2 + 4 = 6 bytes (vs 8 bytes in FP16)

Dequantized: [0.118, -0.471, 0.830, -0.052] (int8 × scale)

Values are close but not identical to the originals. Lower-bit formats (Q4) lose more precision.

For a realistic block of 32 weights: 1 FP16 scale (2 bytes) + 32 int8 values (32 bytes) = 34 bytes. Compared to 64 bytes in FP16, this is a 47% savings. At Q4_0 (4-bit), 32 weights fit in 2 + 16 = 18 bytes — a 72% savings.

Shapes

Weight matrix shape is unchanged: [d_in, d_out]

Storage format changes: contiguous FP16 → blocks of (scale + quantized values)

Q8_0: 32 int8 values + 1 FP16 scale per block = 34 bytes per 32 weights (1.0625 bytes/weight vs 2 bytes/weight in FP16).

Q4_0: 32 values in 4 bits + 1 FP16 scale per block = 18 bytes per 32 weights (0.5625 bytes/weight).

Why Quantization Helps Decode More Than Prefill

Recall from L28 and L33 that decode is memory-bound: the bottleneck is loading weight data, not computing with it. Quantization directly attacks this bottleneck by reducing the bytes per weight:

FP16 → Q8_0: Weights are ~2× smaller. Memory traffic for loading weights drops ~2×. Decode speed improves roughly proportionally.
FP16 → Q4_0: Weights are ~4× smaller. Decode speed can improve up to ~4× (with some overhead for dequantization arithmetic).

Prefill benefits less because it is compute-bound — the weights are reused across many token rows, so loading them is not the bottleneck. Quantization still helps prefill by reducing cache pressure and allowing larger models to fit in memory, but the speedup is smaller than for decode.

This asymmetry is why many practitioners quantize aggressively: the decode phase (which determines token generation speed that users directly experience) benefits the most.

Math

Quantize: qᵢ = round(xᵢ / scale)

Dequantize: x̂ᵢ = qᵢ × scale

scale = max(|x|) / max_int_value

Quantization error: |xᵢ − x̂ᵢ| ≤ scale / 2 per element. Accumulated across a dot product over thousands of dimensions, small per-element errors can shift the result.

Implementation Hook

In llama.cpp, quantization formats are defined in ggml/src/ggml-common.h. Each format (Q4_0, Q4_K_M, Q8_0, etc.) has a block structure, and corresponding dequantize and dot-product kernels. Model conversion uses convert_hf_to_gguf.py to produce quantized .gguf files.

ggml/src/ggml-common.h — block_q8_0 (L234)

Performance Hook

Quantization improves memory-bound operations (like decode GEMV) significantly because it reduces bytes moved through the memory bus. But it also changes the arithmetic: dequantization has a cost, and specialized integer dot-product instructions may be faster or slower than floating-point equivalents depending on the ISA. The speed gain is real but the output quality changes — always validate after quantizing.

Check Yourself

reasoningQ1

Why can quantization affect both speed and output quality?

It changes the model architecture by removing layersIt reduces the precision of weights (affecting quality) and reduces bytes moved through memory (affecting speed)It only affects speed — quality stays exactly the same

mathQ2

A Q8_0 block has scale = 0.02 and stores the int8 value 64 for one weight. What is the dequantized floating-point value? And if the original weight was 1.30, what is the quantization error?

Dequantized: 1.28, error: 0.02Dequantized: 3.20, error: 1.90Dequantized: 0.64, error: 0.66