L23

Shared-KV Layers Reuse Memory Instead of Rebuilding It

14 min

Question

Why share K/V across layers?

Intuition

In a standard transformer, every layer computes its own K and V from scratch by projecting the input through separate weight matrices. This means each layer has its own KV cache, and the total cache grows linearly with the number of layers.

Shared-KV layers break this pattern. Some layers compute fresh K and V (these are the "primary" layers), while other layers reuse the K and V from a previous layer but compute their own fresh Q.

The reusing layer still does full attention — it has its own query vectors, so it asks different questions. But it looks up answers in the same key-value store that an earlier layer built. This avoids the cost of computing and storing new K/V projections.

The trade-off is representational: every layer that reuses K/V cannot learn a different "knowledge index" for that layer. One intuition for why this works in practice: K and V encode information about the tokens themselves — their content, their relationships, their positions. This information changes relatively slowly across adjacent layers because the residual connections preserve most of the representation. Q, on the other hand, encodes what the current layer wants to know — a layer-specific question. Giving each layer its own Q while sharing K/V is like letting different people ask different questions about the same reference index. The index (K/V) is reusable; the questions (Q) should vary.

What a Shared Layer Saves

A shared-KV layer skips two expensive operations compared to a normal layer:

No K/V projection compute. The W_K and W_V matrix multiplies are skipped entirely. For a layer with d_model = 4096 and 8 KV heads with d_head = 128, that is 2 × 4096 × 1024 × 2 = ~16.8M FLOPs saved per token per layer.
No KV cache write. The layer does not add new K/V entries to the cache — it reads from an existing cache slot. During decode, this means fewer bytes written to memory per step, which matters when memory bandwidth is the bottleneck.

The Q projection and the attention computation itself still run at full cost. Shared-KV is not "skipping attention" — it is skipping the key/value preparation while keeping the attention mechanism intact.

How Shared-KV Interacts with GQA and SWA

In real architectures like Gemma 4, shared-KV layers are combined with GQA and SWA. The model config specifies which layers compute fresh KV and which layers reuse an earlier layer's cache — both SWA and full-attention layers can be shared. The combination stacks three savings: fewer KV heads (GQA), bounded cache for SWA layers, and no K/V projection in shared layers.

When reading model configs, look for which layers have K/V weights and which do not. A layer without wk and wv entries is a shared-KV layer — it reads from another layer's cache.

Toy Example

A model with 6 layers. Layers 0, 2, 4 compute fresh K/V. Layers 1, 3, 5 reuse the K/V from the previous layer:

Layer 0: fresh Q, fresh K, fresh V ← primary

Layer 1: fresh Q, reuse K₀, reuse V₀ ← shared

Layer 2: fresh Q, fresh K, fresh V ← primary

Layer 3: fresh Q, reuse K₂, reuse V₂ ← shared

Layer 4: fresh Q, fresh K, fresh V ← primary

Layer 5: fresh Q, reuse K₄, reuse V₄ ← shared

KV cache needed: 3 layers worth instead of 6. The W_K and W_V weight matrices are also absent from shared layers, reducing parameter count.

Shapes

Primary layer: Q, K, V each projected from input → full KV cache slot

Shared layer: Q projected from input; K, V read from earlier cache slot

Attention output shape is unchanged: [n_tokens, d_model].

Math

Primary layer ℓ: Qℓ = xW_Qℓ, Kℓ = xW_Kℓ, Vℓ = xW_Vℓ

Shared layer ℓ': Qℓ' = xW_Qℓ', Kℓ' = Kℓ, Vℓ' = Vℓ

The shared layer has no W_K or W_V of its own. It borrows K and V from the designated primary layer.

Implementation Hook

In llama.cpp, shared-KV layers reference an earlier layer's KV cache entry rather than allocating their own. The model config specifies which layers share and which are primary. Some architectures use shared-KV layers.

src/llama-graph.cpp — build_attn()

Performance Hook

Shared-KV layers save in three ways: (1) no K/V projection compute in the shared layer, (2) no additional KV cache memory, (3) fewer weight parameters to load from memory. In a memory-bandwidth-bound decode step, skipping K/V projection and cache writes is a direct speedup.

Check Yourself

reasoningQ1

A shared-KV layer skips the K and V projections. But it still computes Q from its own learned W_Q. Why can it not also skip Q?

Because Q determines what each token is looking for — different layers need different queries to learn different patternsBecause Q is always cheaper to compute than K and VBecause Q is not stored in any cache, so it must be recomputed regardless

mathQ2

If a model has 32 layers and half of them are shared-KV layers, how many layers worth of KV cache are needed?

32168