M6/Inference Mechanics
L28

Decode Adds One Token at a Time

14 min
Why is decode different from prefill?

After prefill, the model has processed the entire prompt. Now it needs to generate output — one token at a time. Each new token depends on the one before it, so you cannot generate them in parallel.

During decode, the model runs a forward pass for one new token. It computes Q for that single token, but it needs K and V from all previous tokens to compute attention. Instead of recomputing K and V from scratch, it reuses the KV cache (the stored key and value vectors from previous tokens — explained fully in L29) — the accumulated keys and values from prefill and all prior decode steps.

This changes the computational profile completely. The matrices are thin: Q has 1 row (one new token), while K and V have many rows (all previous tokens). The FFN also processes just 1 row. These are small matrix multiplies — or more precisely, matrix-vector products — which cannot saturate the hardware's compute units. Decode is memory-bandwidth-bound: the hardware does so little arithmetic relative to the data it must read (weights and cached K/V) that the bottleneck is memory speed, not compute speed. L34 will make this distinction precise.

After prefill of "The cat sat on" (4 tokens) produced "the" as the 5th token. Now decode runs to generate the 6th:

new token: [the] — 1 token
attention — Q is [1, d_head], K from KV cache is [5, d_head]
FFN — 1 row, a matrix-vector product
output: logits for token 6 + KV cache updated

Same computation graph as prefill, but with 1 row instead of n_prompt rows. Very different performance characteristics.

During decode, the model loads the same weight matrices as during prefill — but does far less arithmetic with them. Consider one FFN down-projection: the weight matrix W_down has shape [d_ff, d_model]. With d_ff = 16,384 and d_model = 4,096:

  • Bytes loaded: 16,384 × 4,096 × 2 (FP16) = ~134 MB for this one weight matrix.
  • FLOPs performed: 2 × 16,384 × 4,096 = ~134 million FLOPs (each multiply-add counts as 2 FLOPs).
  • Arithmetic intensity: 134M FLOPs / 134 MB = ~1 FLOP/byte. At a machine balance of 100 FLOP/byte, the hardware spends 99% of its time waiting for data.

This is the fundamental reason decode is slow: each weight element is loaded from memory but used only once (for one token). During prefill with n tokens, the same weight is used n times, making the arithmetic intensity n times higher. This is the same analysis you will formalize in M7, but the intuition is simple: one token cannot keep the hardware busy.

Decode latency (time per generated token) is dominated by how fast you can stream weight data through the memory bus. Improvements that help prefill (better matrix-multiply kernels, larger batch sizes) help decode very little, because decode is not bottlenecked on compute. Instead, decode benefits from:

  • Quantization — smaller weights means fewer bytes to load (you will learn this in M7).
  • Fewer KV heads (GQA/MQA) — less KV cache data to read per attention layer.
  • Batching multiple requests — processing several users' decode tokens at once increases arithmetic intensity because the loaded weights serve multiple tokens. This is why serving latency and throughput are in tension.
New Q per head: [1, d_head]
K after appending new token: [n_past + 1, d_head]
Attention scores: [1, n_past + 1]
FFN input: [1, d_model]
n_past = tokens already in the cache before this step. The new token's K/V are appended before the dot product, so K has n_past + 1 rows.

During decode, attention is a matrix-vector product, not a matrix-matrix product:

q = x_new · W_Q   // [1, d_head]
K = concat(K_cached, k_new)   // [n_past + 1, d_head]
scores = q · Kᵀ   // [1, n_past + 1] — vector, not matrix
The arithmetic is tiny; the cost is reading K from memory.

In llama.cpp, decode calls the same llama_decode() function as prefill, but with a batch of size 1. The computation graph is identical — what changes is the shape of the input tensors. Different backends may select different kernels for the small matrix sizes typical of decode.

Decode is memory-bandwidth-bound. Each step reads the full model weights plus the KV cache, but only does a tiny amount of arithmetic (one row). Faster memory (higher GB/s) helps decode more than faster compute (higher TFLOPS). This is why decode tokens per second is often much lower than prefill tokens per second — the GPU arithmetic units are mostly idle, waiting for data.

Check Yourself
mathQ1

Consider a single projection weight matrix W of shape [4096, 4096] stored in FP16. During one decode step (1 token), how many bytes must be loaded and how many FLOPs are performed for this one matrix?

shapeQ2

During decode, what determines the width of the attention score vector?