Grouped-Query and Multi-Query Attention
Standard multi-head attention gives every head its own Q, K, and V projections. That means the KV cache stores one key vector and one value vector per head, per token, per layer. For long sequences, this cache dominates memory.
Grouped-Query Attention (GQA) reduces the number of K/V heads. Instead of giving every Q head its own K/V pair, several Q heads share the same K and V. If you have 32 Q heads and 8 KV heads, each KV head serves a group of 4 Q heads.
Multi-Query Attention (MQA) is the extreme case: a single K/V head shared by all Q heads. This minimizes KV cache size but reduces the model's ability to learn diverse attention patterns across heads.
GQA is the practical middle ground used in most modern models. It saves memory and bandwidth without the quality loss that MQA can cause.
The key insight behind GQA is that Q and K/V play fundamentally different roles in attention. Q (query) determines what each token is looking for — each head can ask a different question. K/V (key/value) determines what information is available to be found.
It turns out that the "available information" can be shared across head groups without much quality loss — multiple query heads can ask different questions of the same key/value data. But the queries need to stay independent: if you reduce Q heads, each head loses its ability to attend to different aspects of the input.
This asymmetry is why GQA reduces K/V heads but keeps all Q heads. The memory savings come entirely from the KV cache (which stores per-head entries) and the K/V weight matrices (which are smaller). Q heads and the Q weight matrix stay at full size.
For a model with 32 layers, d_head = 128, and FP16 storage, the KV cache per token is:
For a 4,096-token context: MHA uses ~2 GB of cache, GQA uses ~512 MB, MQA uses ~64 MB. The difference determines whether the model fits in GPU memory and how fast decode runs (less cache = less data to read per attention step).
A model with 8 Q heads, comparing MHA vs GQA vs MQA:
GQA uses 4x less KV cache than MHA here. MQA uses 8x less.
The attention formula stays the same. What changes is the shape of K and V:
MHA: group_size = 1. MQA: group_size = n_q_heads. GQA: anything in between.
In llama.cpp, the number of KV heads is a model hyperparameter (n_head_kv). The build_attn() function handles repeating KV heads to match Q head count. When n_head_kv < n_head, GQA is active.
Fewer KV heads means a smaller KV cache. This directly reduces memory usage and the bandwidth required to read K/V during decode. For long-context models, this is the primary reason GQA exists: it makes long sequences feasible on limited hardware.
A model has 32 Q heads and 8 KV heads (GQA). How many Q heads share each KV head?
If you change from GQA (8 KV heads) to MQA (1 KV head) while keeping 32 Q heads, what happens?