The KV Cache Stores Past Attention State
During attention, each token's query (Q) needs to interact with the keys (K) and values (V) of all previous tokens. Without caching, every decode step would recompute K and V for every token from scratch — repeating all the work from every prior step.
The KV cache stores the K and V projections that have already been computed. When a new token arrives during decode, the model only computes K and V for that one new token and appends them to the cache. The query then attends to the full cached sequence.
This avoids quadratic recomputation — but it costs memory. The cache stores one K vector and one V vector per token, per layer, per KV head. (In GQA/MQA models, there are fewer KV heads than query heads — which is one reason those variants exist.) For a dense model with 32 layers, 32 KV heads, d_head=128, and float16, the cache is 2 × 32 × 32 × 128 × 2 bytes = 512 KB per token. A 4,096-token context uses about 2 GB of KV cache alone.
Generating tokens after a 3-token prompt, showing cache growth:
The cache grows by one entry per token generated. Each entry is (K, V) for every layer and KV head.
Without the cache, decode step t would recompute K for all t tokens:
The same applies to V. This turns O(t²) recomputation into O(t) append-and-read.
Without the KV cache, generating the t-th token would require rerunning the K and V projections for all t tokens. Over a 100-token generation, this means computing K/V for: 1 + 2 + 3 + ... + 100 = 5,050 token-projection operations. With the cache, you compute K/V for exactly 100 tokens total (each once, then cached). The cache trades memory for a ~50× reduction in redundant compute at this sequence length — and the savings grow linearly with generation length.
The tradeoff is clear: the cache uses memory that scales with context length, but it avoids re-doing work that grows quadratically without it. For any non-trivial generation length, caching wins overwhelmingly.
In llama.cpp, the KV cache is pre-allocated for the maximum context length when the
context is created. A model configured with n_ctx = 8192 allocates cache for 8192 tokens upfront, even if
the first prompt is only 100 tokens. This avoids expensive reallocation during generation but means unused
cache slots consume memory. Other serving systems (e.g., vLLM) use paged allocation that grows on demand,
trading allocation overhead for better memory utilization.
For SWA layers, entries older than the window size can be evicted — their slots are reused for new tokens. For full-attention layers, entries must be retained for the full context. Real implementations track which slots are active and which can be reclaimed.
In llama.cpp, the KV cache is allocated once when the context is created. During each decode step, new K and V entries are written into the cache at the current position. The cache memory is pre-allocated for the full context length, even if only a fraction is used so far. When a request finishes, its cache slots are freed for the next request.
The KV cache is often the dominant memory consumer during inference — larger than the memory needed for activations. Quantizing the cache (e.g., to Q8 or Q4) reduces its footprint but introduces approximation. Techniques like GQA (grouped-query attention) reduce cache size by sharing K/V across head groups, which is why many modern models use it.
A model has 32 layers, 8 KV heads per layer, and d_head = 128. A user sends a 1,000-token prompt. Roughly how many float values does the KV cache hold after prefill?
If you switch from full multi-head attention (32 KV heads) to GQA with 8 KV heads, what happens to the KV cache size?