M5/Architecture Variants
L23

Shared-KV Layers Reuse Memory Instead of Rebuilding It

14 min
Why share K/V across layers?

In a standard transformer, every layer computes its own K and V from scratch by projecting the input through separate weight matrices. This means each layer has its own KV cache, and the total cache grows linearly with the number of layers.

Shared-KV layers break this pattern. Some layers compute fresh K and V (these are the "primary" layers), while other layers reuse the K and V from a previous layer but compute their own fresh Q.

The reusing layer still does full attention — it has its own query vectors, so it asks different questions. But it looks up answers in the same key-value store that an earlier layer built. This avoids the cost of computing and storing new K/V projections.

The trade-off is representational: every layer that reuses K/V cannot learn a different "knowledge index" for that layer. One intuition for why this works in practice: K and V encode information about the tokens themselves — their content, their relationships, their positions. This information changes relatively slowly across adjacent layers because the residual connections preserve most of the representation. Q, on the other hand, encodes what the current layer wants to know — a layer-specific question. Giving each layer its own Q while sharing K/V is like letting different people ask different questions about the same reference index. The index (K/V) is reusable; the questions (Q) should vary.

A shared-KV layer skips two expensive operations compared to a normal layer:

  1. No K/V projection compute. The WK and WV matrix multiplies are skipped entirely. For a layer with d_model = 4096 and 8 KV heads with d_head = 128, that is 2 × 4096 × 1024 × 2 = ~16.8M FLOPs saved per token per layer.
  2. No KV cache write. The layer does not add new K/V entries to the cache — it reads from an existing cache slot. During decode, this means fewer bytes written to memory per step, which matters when memory bandwidth is the bottleneck.

The Q projection and the attention computation itself still run at full cost. Shared-KV is not "skipping attention" — it is skipping the key/value preparation while keeping the attention mechanism intact.

In real architectures like Gemma 4, shared-KV layers are combined with GQA and SWA. The model config specifies which layers compute fresh KV and which layers reuse an earlier layer's cache — both SWA and full-attention layers can be shared. The combination stacks three savings: fewer KV heads (GQA), bounded cache for SWA layers, and no K/V projection in shared layers.

When reading model configs, look for which layers have K/V weights and which do not. A layer without wk and wv entries is a shared-KV layer — it reads from another layer's cache.

A model with 6 layers. Layers 0, 2, 4 compute fresh K/V. Layers 1, 3, 5 reuse the K/V from the previous layer:

Layer 0: fresh Q, fresh K, fresh V   ← primary
Layer 1: fresh Q, reuse K₀, reuse V₀   ← shared
Layer 2: fresh Q, fresh K, fresh V   ← primary
Layer 3: fresh Q, reuse K₂, reuse V₂   ← shared
Layer 4: fresh Q, fresh K, fresh V   ← primary
Layer 5: fresh Q, reuse K₄, reuse V₄   ← shared

KV cache needed: 3 layers worth instead of 6. The W_K and W_V weight matrices are also absent from shared layers, reducing parameter count.

Primary layer: Q, K, V each projected from input → full KV cache slot
Shared layer: Q projected from input; K, V read from earlier cache slot
Attention output shape is unchanged: [n_tokens, d_model].
Primary layer ℓ:   Qℓ = xW_Qℓ,   Kℓ = xW_Kℓ,   Vℓ = xW_Vℓ
Shared layer ℓ':   Qℓ' = xW_Qℓ',   Kℓ' = Kℓ,   Vℓ' = Vℓ

The shared layer has no W_K or W_V of its own. It borrows K and V from the designated primary layer.

In llama.cpp, shared-KV layers reference an earlier layer's KV cache entry rather than allocating their own. The model config specifies which layers share and which are primary. Some architectures use shared-KV layers.

Shared-KV layers save in three ways: (1) no K/V projection compute in the shared layer, (2) no additional KV cache memory, (3) fewer weight parameters to load from memory. In a memory-bandwidth-bound decode step, skipping K/V projection and cache writes is a direct speedup.

Check Yourself
reasoningQ1

A shared-KV layer skips the K and V projections. But it still computes Q from its own learned W_Q. Why can it not also skip Q?

mathQ2

If a model has 32 layers and half of them are shared-KV layers, how many layers worth of KV cache are needed?