L26

Gemma 4 Combines Several Variants at Once

16 min

Question

What makes Gemma 4 different?

Intuition

A plain dense decoder repeats the same block — attention then FFN — for every layer. Gemma 4 breaks this uniformity by mixing several of the variants you have already learned:

SWA pattern Some layers use sliding-window attention with a fixed window size. Others use full causal attention. The pattern alternates on a schedule defined in the model config.
Shared-KV Certain layers reuse K/V from an earlier layer instead of computing their own. This cuts KV cache size and removes K/V projection cost from those layers.
MoE layers Some layers replace the dense FFN with a Mixture-of-Experts layer. Each token routes to a small subset of experts, keeping per-token compute manageable while the total parameter count is large.
Per-layer config Each layer can have its own combination: SWA or full attention, shared or fresh K/V, dense FFN or MoE. The model is no longer a uniform stack.

This heterogeneity makes Gemma 4 more complex to implement and reason about, but it allows the architecture to spend resources where they matter most: full attention and fresh K/V for layers that need long-range or diverse representations, local attention and shared K/V where redundancy would be wasted.

Why Heterogeneous Layers

A uniform decoder stack treats every layer identically: each layer gets the same attention scope, its own K/V projections, and the same FFN size. This is simple but wasteful. Research on layer-wise behavior shows that not all layers contribute equally:

Early layers tend to learn local syntactic patterns — word order, phrase boundaries, morphology. They rarely need to attend 100K tokens back. A sliding window is sufficient and much cheaper.
Middle layers learn semantic relationships that sometimes span long distances (coreference, discourse structure) but often do not. An alternating pattern — some full attention, some local — captures both needs without paying the full cost everywhere.
Later layers refine representations toward the specific prediction task. These often benefit from full attention (to access any information the earlier layers gathered) and from MoE FFNs (to specialize the per-token transformation without making the dense FFN enormous).

The key insight: a uniform stack forces you to provision every layer for the worst case. If any layer needs full attention, all layers get it. If any layer needs a large FFN, all layers get it. Heterogeneous design lets each layer be right-sized for its actual role.

How Config Specifies Per-Layer Behavior

In a uniform model, the config is simple: one attention type, one FFN type, applied everywhere. Gemma 4 introduces per-layer config arrays. The model's metadata contains arrays that map each layer index to its specific settings:

// Conceptual config (simplified from actual model metadata)

attn_type: [full, swa, full, swa, full, swa, ...]

kv_shared_layer: [-1, 0, -1, 2, -1, 4, ...] // -1 = fresh, N = share from layer N

ffn_type: [dense, dense, moe, dense, dense, moe, ...]

The model builder reads these arrays and dispatches differently for each layer. This is fundamentally different from the uniform pattern where a single for i in range(n_layers) loop calls the same functions. Now each iteration must check: What attention type? Do I compute K/V or borrow? Dense FFN or MoE?

For the implementer, this means more branching in the graph-building code. For the performance analyst, it means you cannot profile "a representative layer" — different layers have genuinely different costs and bottlenecks.

The Memory Budget Tradeoff

Heterogeneous design has a concrete memory payoff. Consider a hypothetical 42-layer model:

Uniform design (42 layers, all full attention, all fresh KV):

KV cache entries: 42 layers × n_kv_heads × 2 (K+V)

Gemma 4-style (14 fresh KV, 28 shared; 14 full attn, 28 SWA w=4096):

Fresh KV: 14 layers × n_kv_heads × 2 (K+V)

Shared KV: 0 additional cache (reuse from fresh layers)

SWA layers: cache bounded at w=4096 per head (vs unbounded for full)

The combination of shared K/V (cutting unique KV layers by 2/3) and SWA (bounding cache per layer) dramatically reduces the total KV cache footprint. This is what makes long-context serving feasible: a uniform model with 42 full-attention, fresh-KV layers would need an enormous cache at 128K context. The heterogeneous design achieves similar quality with a fraction of the memory.

Toy Example

Simplified 6-layer Gemma 4-style config:

Layer 0: full attn, fresh KV, dense FFN

Layer 1: SWA, shared KV (from 0), dense FFN

Layer 2: full attn, fresh KV, MoE FFN

Layer 3: SWA, shared KV (from 2), dense FFN

Layer 4: full attn, fresh KV, dense FFN

Layer 5: SWA, shared KV (from 4), MoE FFN

No two layers are necessarily configured the same way. The build loop must check each layer's config.

Shapes

Input and output per layer: [n_tokens, d_model] unchanged

What varies per layer:

Attention window: full (n_tokens) or sliding (w)

KV source: own projection or borrowed from another layer

FFN type: dense [d_model → d_ff → d_model] or MoE [routed across n_experts]

Math

No new formula. Gemma 4 composes the pieces you already know:

for each layer ℓ:

attn_typeℓ ∈ {full, SWA}

kv_sourceℓ ∈ {fresh, shared(ℓ')}

ffn_typeℓ ∈ {dense, MoE}

Implementation Hook

In llama.cpp, Gemma 4's model builder loops over layers and checks per-layer config to decide the attention type, KV sharing, and FFN type. This is the most complex model builder in the codebase because no layer is assumed to be like any other.

src/models/gemma4-iswa.cpp — layer loop (L30)

Performance Hook

Heterogeneous layers make performance analysis harder. Some layers are cheap (SWA + shared KV + dense FFN). Some are expensive (full attention + fresh KV + MoE). Profiling a Gemma 4 model shows uneven per-layer timing. You cannot assume every layer costs the same — a mistake common when reasoning about plain dense decoders. Next, you will see how these architectures actually execute at inference time: the split between prefill and decode, and why the distinction matters for every optimization decision.

Check Yourself

conceptualQ1

What makes Gemma 4 more complex than a plain dense decoder?

It uses a larger vocabularyEach layer can have a different combination of attention type, KV source, and FFN typeIt uses a different activation function

conceptualQ2

Why can you not assume every layer in Gemma 4 has the same inference cost?

Because different layers have different d_model sizesBecause layers vary in attention scope, KV computation, and whether the FFN is dense or MoEBecause the optimizer treats each layer differently