Gemma 4 Combines Several Variants at Once
A plain dense decoder repeats the same block — attention then FFN — for every layer. Gemma 4 breaks this uniformity by mixing several of the variants you have already learned:
- SWA pattern Some layers use sliding-window attention with a fixed window size. Others use full causal attention. The pattern alternates on a schedule defined in the model config.
- Shared-KV Certain layers reuse K/V from an earlier layer instead of computing their own. This cuts KV cache size and removes K/V projection cost from those layers.
- MoE layers Some layers replace the dense FFN with a Mixture-of-Experts layer. Each token routes to a small subset of experts, keeping per-token compute manageable while the total parameter count is large.
- Per-layer config Each layer can have its own combination: SWA or full attention, shared or fresh K/V, dense FFN or MoE. The model is no longer a uniform stack.
This heterogeneity makes Gemma 4 more complex to implement and reason about, but it allows the architecture to spend resources where they matter most: full attention and fresh K/V for layers that need long-range or diverse representations, local attention and shared K/V where redundancy would be wasted.
A uniform decoder stack treats every layer identically: each layer gets the same attention scope, its own K/V projections, and the same FFN size. This is simple but wasteful. Research on layer-wise behavior shows that not all layers contribute equally:
- Early layers tend to learn local syntactic patterns — word order, phrase boundaries, morphology. They rarely need to attend 100K tokens back. A sliding window is sufficient and much cheaper.
- Middle layers learn semantic relationships that sometimes span long distances (coreference, discourse structure) but often do not. An alternating pattern — some full attention, some local — captures both needs without paying the full cost everywhere.
- Later layers refine representations toward the specific prediction task. These often benefit from full attention (to access any information the earlier layers gathered) and from MoE FFNs (to specialize the per-token transformation without making the dense FFN enormous).
The key insight: a uniform stack forces you to provision every layer for the worst case. If any layer needs full attention, all layers get it. If any layer needs a large FFN, all layers get it. Heterogeneous design lets each layer be right-sized for its actual role.
In a uniform model, the config is simple: one attention type, one FFN type, applied everywhere. Gemma 4 introduces per-layer config arrays. The model's metadata contains arrays that map each layer index to its specific settings:
The model builder reads these arrays and dispatches differently for each layer. This is fundamentally different
from the uniform pattern where a single for i in range(n_layers) loop calls the same functions.
Now each iteration must check: What attention type? Do I compute K/V or borrow? Dense FFN or MoE?
For the implementer, this means more branching in the graph-building code. For the performance analyst, it means you cannot profile "a representative layer" — different layers have genuinely different costs and bottlenecks.
Heterogeneous design has a concrete memory payoff. Consider a hypothetical 42-layer model:
The combination of shared K/V (cutting unique KV layers by 2/3) and SWA (bounding cache per layer) dramatically reduces the total KV cache footprint. This is what makes long-context serving feasible: a uniform model with 42 full-attention, fresh-KV layers would need an enormous cache at 128K context. The heterogeneous design achieves similar quality with a fraction of the memory.
Simplified 6-layer Gemma 4-style config:
No two layers are necessarily configured the same way. The build loop must check each layer's config.
No new formula. Gemma 4 composes the pieces you already know:
In llama.cpp, Gemma 4's model builder loops over layers and checks per-layer config to decide the attention type, KV sharing, and FFN type. This is the most complex model builder in the codebase because no layer is assumed to be like any other.
Heterogeneous layers make performance analysis harder. Some layers are cheap (SWA + shared KV + dense FFN). Some are expensive (full attention + fresh KV + MoE). Profiling a Gemma 4 model shows uneven per-layer timing. You cannot assume every layer costs the same — a mistake common when reasoning about plain dense decoders. Next, you will see how these architectures actually execute at inference time: the split between prefill and decode, and why the distinction matters for every optimization decision.
What makes Gemma 4 more complex than a plain dense decoder?
Why can you not assume every layer in Gemma 4 has the same inference cost?