Case Study: Gemma 4 Attention Path
Full self-attention is powerful but expensive: every token attends to every other token, so cost scales quadratically with sequence length. Gemma 4 solves this by using two types of attention layers that alternate through the model:
- Full attention Every token attends to every other token in the context. Used in a minority of layers.
- Sliding Window Attention (SWA) Each token only attends to a fixed-size local window of nearby tokens. Used in the majority of layers.
SWA layers are much cheaper because the attention matrix is sparse — each row only has entries for tokens within the window. The occasional full-attention layer ensures the model can still capture long-range dependencies.
On top of this, Gemma 4 uses shared-KV layers. Only a subset of layers (determined by the model config) compute their own K/V projections. The remaining layers skip K/V projection entirely and reuse the KV cache written by an earlier layer. The exact reuse pattern depends on the model's layer configuration. This dramatically reduces memory usage and compute — shared layers need no K/V weight matrices and write nothing to the cache.
The builder code handles this by checking has_kv(il) for each layer. If true, the layer projects fresh K and V. If false, it passes nullptr for K/V, telling build_attn() to read from an earlier layer's cache entry. The attention computation itself is the same — only the cache source and the window mask differ.
A 6-layer Gemma 4 model with window size 3 and layers [SWA, SWA, full, SWA, SWA, full]:
Shared-KV layers avoid K/V projection compute and cache storage — significant savings when most layers are shared.
The attention math is unchanged — the difference is which keys and values are visible:
The Gemma 4 iSWA builder splits each layer's attention path based on whether the layer has its own K/V projections or reuses an earlier layer's KV cache. Here is the branching logic:
// self-attention: branch on whether this layer has its own K/V
if (hparams.has_kv(il)) {
// KV layer: project K and V from the normed hidden state
ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
ggml_tensor * Vcur = model.layers[il].wv
? build_lora_mm(model.layers[il].wv, cur)
: Kcur; // if no v_proj, reuse K as V
// Reshape to [d_head, n_kv_heads, n_tokens]
Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);
// Gemma 4 normalizes K and V before RoPE — unusual!
Kcur = build_norm(Kcur, model.layers[il].attn_k_norm, ...);
Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);
// Apply RoPE to K (Q was already rotated above)
Kcur = ggml_rope_ext(ctx0, Kcur, inp_pos, freq_factors, ...);
cur = build_attn(inp_attn, model.layers[il].wo, nullptr,
Qcur, Kcur, Vcur, nullptr, nullptr, nullptr,
hparams.f_attention_scale, il);
} else {
// Shared-KV layer: no K/V projections, reuses earlier cache
cur = build_attn(inp_attn,
model.layers[il].wo, nullptr,
Qcur, nullptr, nullptr, nullptr, nullptr, nullptr,
hparams.f_attention_scale, il);
}
Source: ggml-org/llama.cpp @ 94ca829b — src/models/gemma4-iswa.cpp
The key difference: when has_kv(il) is true, the layer computes its own K and V from the hidden state and writes them to a cache. When false, K and V are nullptr — the build_attn helper reads K/V from a shared cache written by an earlier layer. The Q projection always runs; only K/V are conditionally shared.
The build_attn_inp_kv_iswa() call sets up the KV cache infrastructure that makes shared layers work. SWA layers and full-attention layers use different attention masks (windowed vs. causal), but the build_attn() call itself is identical — the infrastructure handles the difference.
SWA layers are cheaper in two ways: (1) the attention computation is O(n × w) instead of O(n²), where w is the window size, and (2) sharing one KV cache across all SWA layers reduces memory bandwidth during decode. For long sequences, the memory savings from shared-KV can be the difference between fitting in GPU memory and not.
What is the difference between SWA layers and full-attention layers in Gemma 4?
In the Gemma 4 builder, what does has_kv(il) returning false mean for that layer?
Why does Gemma 4 still include some full-attention layers instead of using SWA everywhere?