L40

Case Study: Gemma 4 Attention Path

20 min

Question

How does Gemma 4 handle attention?

Intuition

Full self-attention is powerful but expensive: every token attends to every other token, so cost scales quadratically with sequence length. Gemma 4 solves this by using two types of attention layers that alternate through the model:

Full attention Every token attends to every other token in the context. Used in a minority of layers.
Sliding Window Attention (SWA) Each token only attends to a fixed-size local window of nearby tokens. Used in the majority of layers.

SWA layers are much cheaper because the attention matrix is sparse — each row only has entries for tokens within the window. The occasional full-attention layer ensures the model can still capture long-range dependencies.

On top of this, Gemma 4 uses shared-KV layers. Only a subset of layers (determined by the model config) compute their own K/V projections. The remaining layers skip K/V projection entirely and reuse the KV cache written by an earlier layer. The exact reuse pattern depends on the model's layer configuration. This dramatically reduces memory usage and compute — shared layers need no K/V weight matrices and write nothing to the cache.

The builder code handles this by checking has_kv(il) for each layer. If true, the layer projects fresh K and V. If false, it passes nullptr for K/V, telling build_attn() to read from an earlier layer's cache entry. The attention computation itself is the same — only the cache source and the window mask differ.

Toy Example

A 6-layer Gemma 4 model with window size 3 and layers [SWA, SWA, full, SWA, SWA, full]:

sequence: [A, B, C, D, E, F, G, H] (8 tokens)

SWA layer (window=3): token H attends to [F, G, H] only

Full layer: token H attends to [A, B, C, D, E, F, G, H]

If layers 0 and 2 have fresh KV (has_kv=true), layers 1,3,4,5 reuse earlier caches. Only 2 KV entries stored instead of 6.

Shared-KV layers avoid K/V projection compute and cache storage — significant savings when most layers are shared.

Shapes

Full attention: Q × K^T → [n_tokens, n_tokens] (all-to-all)

SWA attention: Q × K^T → [n_tokens, window_size] (local only)

Shared SWA KV cache: [window_size, d_k] (one cache for all SWA layers)

Per-layer full KV cache: [n_tokens, d_k] (one cache per full-attention layer)

Math

The attention math is unchanged — the difference is which keys and values are visible:

Full: scores = Q ⋅ K^T / √d_k (all positions)

SWA: scores = Q ⋅ K_window^T / √d_k (local window only)

Positions outside the SWA window are masked to −∞ before softmax.

The Real Code

The Gemma 4 iSWA builder splits each layer's attention path based on whether the layer has its own K/V projections or reuses an earlier layer's KV cache. Here is the branching logic:

// self-attention: branch on whether this layer has its own K/V
if (hparams.has_kv(il)) {
    // KV layer: project K and V from the normed hidden state
    ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
    ggml_tensor * Vcur = model.layers[il].wv
                            ? build_lora_mm(model.layers[il].wv, cur)
                            : Kcur; // if no v_proj, reuse K as V

    // Reshape to [d_head, n_kv_heads, n_tokens]
    Kcur = ggml_reshape_3d(ctx0, Kcur, n_embd_head, n_head_kv, n_tokens);
    Vcur = ggml_reshape_3d(ctx0, Vcur, n_embd_head, n_head_kv, n_tokens);

    // Gemma 4 normalizes K and V before RoPE — unusual!
    Kcur = build_norm(Kcur, model.layers[il].attn_k_norm, ...);
    Vcur = ggml_rms_norm(ctx0, Vcur, hparams.f_norm_rms_eps);

    // Apply RoPE to K (Q was already rotated above)
    Kcur = ggml_rope_ext(ctx0, Kcur, inp_pos, freq_factors, ...);

    cur = build_attn(inp_attn, model.layers[il].wo, nullptr,
            Qcur, Kcur, Vcur, nullptr, nullptr, nullptr,
            hparams.f_attention_scale, il);
} else {
    // Shared-KV layer: no K/V projections, reuses earlier cache
    cur = build_attn(inp_attn,
            model.layers[il].wo, nullptr,
            Qcur, nullptr, nullptr, nullptr, nullptr, nullptr,
            hparams.f_attention_scale, il);
}

Source: ggml-org/llama.cpp @ 94ca829b — src/models/gemma4-iswa.cpp

The key difference: when has_kv(il) is true, the layer computes its own K and V from the hidden state and writes them to a cache. When false, K and V are nullptr — the build_attn helper reads K/V from a shared cache written by an earlier layer. The Q projection always runs; only K/V are conditionally shared.

Implementation Hook

The build_attn_inp_kv_iswa() call sets up the KV cache infrastructure that makes shared layers work. SWA layers and full-attention layers use different attention masks (windowed vs. causal), but the build_attn() call itself is identical — the infrastructure handles the difference.

src/models/gemma4-iswa.cpp — layer loop (L30)

Performance Hook

SWA layers are cheaper in two ways: (1) the attention computation is O(n × w) instead of O(n²), where w is the window size, and (2) sharing one KV cache across all SWA layers reduces memory bandwidth during decode. For long sequences, the memory savings from shared-KV can be the difference between fitting in GPU memory and not.

Check Yourself

conceptualQ1

What is the difference between SWA layers and full-attention layers in Gemma 4?

SWA layers use smaller weight matrices than full-attention layersSWA layers only attend to a local window of nearby tokens, while full-attention layers attend to all tokensSWA layers skip the attention step entirely and only run FFN

conceptualQ2

In the Gemma 4 builder, what does has_kv(il) returning false mean for that layer?

The layer uses a different attention algorithmThe layer skips K/V projection and reuses an earlier layer's KV cache entry instead of computing its ownThe layer has no attention at all — it only runs FFN

conceptualQ3

Why does Gemma 4 still include some full-attention layers instead of using SWA everywhere?

Full-attention layers are needed for the FFN to work correctlySWA cannot capture long-range dependencies beyond the window, so occasional full-attention layers provide global contextFull-attention layers are faster than SWA layers for short sequences