L21

Grouped-Query and Multi-Query Attention

14 min

Question

Why use fewer K/V heads than Q heads?

Intuition

Standard multi-head attention gives every head its own Q, K, and V projections. That means the KV cache stores one key vector and one value vector per head, per token, per layer. For long sequences, this cache dominates memory.

Grouped-Query Attention (GQA) reduces the number of K/V heads. Instead of giving every Q head its own K/V pair, several Q heads share the same K and V. If you have 32 Q heads and 8 KV heads, each KV head serves a group of 4 Q heads.

Multi-Query Attention (MQA) is the extreme case: a single K/V head shared by all Q heads. This minimizes KV cache size but reduces the model's ability to learn diverse attention patterns across heads.

GQA is the practical middle ground used in most modern models. It saves memory and bandwidth without the quality loss that MQA can cause.

Why GQA Works: The Asymmetry of Q vs K/V

The key insight behind GQA is that Q and K/V play fundamentally different roles in attention. Q (query) determines what each token is looking for — each head can ask a different question. K/V (key/value) determines what information is available to be found.

It turns out that the "available information" can be shared across head groups without much quality loss — multiple query heads can ask different questions of the same key/value data. But the queries need to stay independent: if you reduce Q heads, each head loses its ability to attend to different aspects of the input.

This asymmetry is why GQA reduces K/V heads but keeps all Q heads. The memory savings come entirely from the KV cache (which stores per-head entries) and the K/V weight matrices (which are smaller). Q heads and the Q weight matrix stay at full size.

Concrete Memory Savings

For a model with 32 layers, d_head = 128, and FP16 storage, the KV cache per token is:

MHA (32 KV heads): 2 × 32 × 32 × 128 × 2 bytes = 512 KB/token

GQA (8 KV heads): 2 × 32 × 8 × 128 × 2 bytes = 128 KB/token (4× smaller)

MQA (1 KV head): 2 × 32 × 1 × 128 × 2 bytes = 16 KB/token (32× smaller)

For a 4,096-token context: MHA uses ~2 GB of cache, GQA uses ~512 MB, MQA uses ~64 MB. The difference determines whether the model fits in GPU memory and how fast decode runs (less cache = less data to read per attention step).

Toy Example

A model with 8 Q heads, comparing MHA vs GQA vs MQA:

MHA 8 Q heads, 8 KV heads → 8 K + 8 V cached per token

GQA 8 Q heads, 2 KV heads → 2 K + 2 V cached per token

MQA 8 Q heads, 1 KV head → 1 K + 1 V cached per token

GQA uses 4x less KV cache than MHA here. MQA uses 8x less.

Shapes

Q: [n_tokens, n_q_heads, d_head]

K: [n_tokens, n_kv_heads, d_head] where n_kv_heads ≤ n_q_heads

V: [n_tokens, n_kv_heads, d_head]

Each KV head is broadcast to (n_q_heads / n_kv_heads) Q heads.

Math

The attention formula stays the same. What changes is the shape of K and V:

group_size = n_q_heads / n_kv_heads

Q heads [i*group_size .. (i+1)*group_size) all use Kᵢ, Vᵢ

MHA: group_size = 1. MQA: group_size = n_q_heads. GQA: anything in between.

Implementation Hook

In llama.cpp, the number of KV heads is a model hyperparameter (n_head_kv). The build_attn() function handles repeating KV heads to match Q head count. When n_head_kv < n_head, GQA is active.

src/llama-graph.cpp — build_attn()

Performance Hook

Fewer KV heads means a smaller KV cache. This directly reduces memory usage and the bandwidth required to read K/V during decode. For long-context models, this is the primary reason GQA exists: it makes long sequences feasible on limited hardware.

Check Yourself

mathQ1

A model has 32 Q heads and 8 KV heads (GQA). How many Q heads share each KV head?

2 (32 / 16)4 (32 / 8)8 (one group per KV head)

conceptualQ2

If you change from GQA (8 KV heads) to MQA (1 KV head) while keeping 32 Q heads, what happens?

KV cache shrinks 8×, but all 32 Q heads now read from the same single K/V — less expressive but much cheaperKV cache stays the same size because Q head count did not changeThe model can no longer run multi-head attention at all