L29

The KV Cache Stores Past Attention State

16 min

Question

What does the KV cache store?

Intuition

During attention, each token's query (Q) needs to interact with the keys (K) and values (V) of all previous tokens. Without caching, every decode step would recompute K and V for every token from scratch — repeating all the work from every prior step.

The KV cache stores the K and V projections that have already been computed. When a new token arrives during decode, the model only computes K and V for that one new token and appends them to the cache. The query then attends to the full cached sequence.

This avoids quadratic recomputation — but it costs memory. The cache stores one K vector and one V vector per token, per layer, per KV head. (In GQA/MQA models, there are fewer KV heads than query heads — which is one reason those variants exist.) For a dense model with 32 layers, 32 KV heads, d_head=128, and float16, the cache is 2 × 32 × 32 × 128 × 2 bytes = 512 KB per token. A 4,096-token context uses about 2 GB of KV cache alone.

Toy Example

Generating tokens after a 3-token prompt, showing cache growth:

prefill: process [A, B, C] → cache = [K,V for A, B, C]

decode step 1: process [D] → cache = [K,V for A, B, C, D]

decode step 2: process [E] → cache = [K,V for A, B, C, D, E]

The cache grows by one entry per token generated. Each entry is (K, V) for every layer and KV head.

Shapes

K cache per layer per KV head: [n_past, d_head]

V cache per layer per KV head: [n_past, d_head]

Total cache: 2 × n_layers × n_kv_heads × n_past × d_head × sizeof(element)

Memory grows linearly with n_past. Longer sequences = more cache memory.

Math

Without the cache, decode step t would recompute K for all t tokens:

// Without cache (wasteful):

K = X_[1:t] · W_K // recompute K for ALL tokens every step

// With cache (actual):

k_new = x_t · W_K // compute K for new token only

K = concat(K_cached, k_new) // append to cache

The same applies to V. This turns O(t²) recomputation into O(t) append-and-read.

Why Cache and Not Recompute?

Without the KV cache, generating the t-th token would require rerunning the K and V projections for all t tokens. Over a 100-token generation, this means computing K/V for: 1 + 2 + 3 + ... + 100 = 5,050 token-projection operations. With the cache, you compute K/V for exactly 100 tokens total (each once, then cached). The cache trades memory for a ~50× reduction in redundant compute at this sequence length — and the savings grow linearly with generation length.

The tradeoff is clear: the cache uses memory that scales with context length, but it avoids re-doing work that grows quadratically without it. For any non-trivial generation length, caching wins overwhelmingly.

Cache Management: Pre-allocation and Eviction

In llama.cpp, the KV cache is pre-allocated for the maximum context length when the context is created. A model configured with n_ctx = 8192 allocates cache for 8192 tokens upfront, even if the first prompt is only 100 tokens. This avoids expensive reallocation during generation but means unused cache slots consume memory. Other serving systems (e.g., vLLM) use paged allocation that grows on demand, trading allocation overhead for better memory utilization.

For SWA layers, entries older than the window size can be evicted — their slots are reused for new tokens. For full-attention layers, entries must be retained for the full context. Real implementations track which slots are active and which can be reclaimed.

Implementation Hook

In llama.cpp, the KV cache is allocated once when the context is created. During each decode step, new K and V entries are written into the cache at the current position. The cache memory is pre-allocated for the full context length, even if only a fraction is used so far. When a request finishes, its cache slots are freed for the next request.

src/llama-kv-cache.cpp — llama_kv_cache (L79)

Performance Hook

The KV cache is often the dominant memory consumer during inference — larger than the memory needed for activations. Quantizing the cache (e.g., to Q8 or Q4) reduces its footprint but introduces approximation. Techniques like GQA (grouped-query attention) reduce cache size by sharing K/V across head groups, which is why many modern models use it.

Check Yourself

mathQ1

A model has 32 layers, 8 KV heads per layer, and d_head = 128. A user sends a 1,000-token prompt. Roughly how many float values does the KV cache hold after prefill?

1,000 × 32 × 8 × 128 × 2 ≈ 65.5M (one K + one V per head per layer per token)1,000 × 32 × 128 ≈ 4.1M (one vector per layer per token)32 × 8 × 128 × 2 ≈ 65K (cache size is independent of token count)

conceptualQ2

If you switch from full multi-head attention (32 KV heads) to GQA with 8 KV heads, what happens to the KV cache size?

It becomes 4× smaller (8/32 = 1/4)It stays the same because Q still has 32 headsIt doubles because the shared heads need more capacity