L41

Case Study: Gemma 4 MoE Layer

22 min

Question

How does Gemma 4 route tokens to experts?

Intuition

In a dense FFN, every token passes through every parameter. A Mixture-of-Experts (MoE) layer replaces the single FFN with many parallel "expert" FFNs and a router that decides which experts each token should use.

The process has three stages:

Router scoring — The router is a small linear layer that takes each token's hidden state and produces a score for every expert. These scores are passed through softmax to become probabilities.
Top-k selection — Only the top-k experts with the highest scores are selected for each token (k varies by model — e.g., k=2 in Mixtral, k=8 in Gemma 4). All other experts are skipped for that token.
Weighted aggregation — Each selected expert processes the token independently, producing its own output. The final output is a weighted sum of the expert outputs, where the weights are the router probabilities (renormalized over just the selected experts).

This means each token only activates a fraction of the total parameters. A model can have enormous total parameter counts but a small "active" parameter count per token, keeping compute costs manageable.

Toy Example

4 experts, top-2 selection for one token:

token hidden state: h = [0.3, −0.7, 1.2, 0.5]

router scores: [0.1, 0.6, 0.05, 0.25]

selected: expert 1 (w=0.71), expert 3 (w=0.29) (renormalized)

expert 1(h) = [0.2, 0.4, −0.1, 0.8]

expert 3(h) = [−0.3, 0.1, 0.6, 0.2]

output: 0.71 × expert1 + 0.29 × expert3 = [0.05, 0.31, 0.10, 0.63]

Experts 0 and 2 were never computed for this token. Only 2 of 4 FFNs ran.

Shapes

Router weight: [d_model, n_experts]

Router output: [n_tokens, n_experts] → softmax → top-k selection

Each expert FFN: [n_tokens_routed, d_model] → [n_tokens_routed, d_model]

Final output: weighted sum → [n_tokens, d_model]

n_tokens_routed varies per expert — different experts may process different numbers of tokens.

Math

Router scoring and weighted aggregation:

scores = softmax(h ⋅ W_router)

selected = top_k(scores, k)

output = ∑_{i ∈ selected} w_i × expert_i(h)

w_i = scores[i] / ∑_{j ∈ selected} scores[j] (renormalized weights)

The Real Code

Gemma 4 MoE layers run a shared expert FFN and a routed set of expert FFNs in parallel, then combine both results. Here is the branching logic inside the layer loop:

const bool is_moe_layer = model.layers[il].ffn_gate_inp != nullptr;

if (is_moe_layer) {
    // 1. Shared expert — always runs, same structure as a dense FFN
    ggml_tensor * cur_mlp = build_norm(attn_out, model.layers[il].ffn_norm, ...);
    cur_mlp = build_ffn(cur_mlp,
            model.layers[il].ffn_up, model.layers[il].ffn_gate,
            model.layers[il].ffn_down, LLM_FFN_GELU, LLM_FFN_PAR, il);
    cur_mlp = build_norm(cur_mlp, model.layers[il].ffn_post_norm_1, ...);

    // 2. Router — custom scoring: normalize, scale, then project to [n_expert, n_tokens]
    ggml_tensor * tmp = ggml_rms_norm(ctx0, attn_out, hparams.f_norm_rms_eps);
    tmp = ggml_scale(ctx0, tmp, 1.0f / sqrtf((float) n_embd));
    tmp = ggml_mul(ctx0, tmp, model.layers[il].ffn_gate_inp_s);
    ggml_tensor * logits = build_lora_mm(model.layers[il].ffn_gate_inp, tmp);

    // 3. Routed experts — logits passed in so build_moe_ffn handles top-k + dispatch
    ggml_tensor * cur_moe = build_norm(attn_out, model.layers[il].ffn_pre_norm_2, ...);
    cur_moe = build_moe_ffn(cur_moe,
            model.layers[il].ffn_down_exps,
            n_expert, n_expert_used, LLM_FFN_GELU,
            LLAMA_EXPERT_GATING_FUNC_TYPE_SOFTMAX,
            il, logits,  // ← router output drives expert selection
            model.layers[il].ffn_gate_up_exps, ...);
    cur_moe = build_norm(cur_moe, model.layers[il].ffn_post_norm_2, ...);

    // 4. Combine shared + routed
    cur = ggml_add(ctx0, cur_mlp, cur_moe);
}

Source: ggml-org/llama.cpp @ 94ca829b — src/models/gemma4-iswa.cpp (simplified for clarity)

The shared expert (cur_mlp) is a standard FFN that always runs. The routed experts (cur_moe) are selected per-token by the router. Both results are added together. This "shared + routed" pattern means every token gets a baseline FFN result even if the router makes a poor selection.

Implementation Hook

The build_moe_ffn() helper in src/llama-graph.cpp handles the full routing loop: softmax over router logits, top-k selection, running each selected expert's FFN, and weighted combination. The model builder passes in the router weights and expert weight tensors; the helper handles the mechanics.

src/models/gemma4-iswa.cpp — layer loop (L30)

src/llama-graph.cpp — build_moe_ffn()

Performance Hook

MoE trades total parameter count for active parameter count. A model with 64 experts but top-2 routing uses only 2/64 = 3.1% of expert parameters per token. This keeps compute (FLOPs) manageable but means all expert weights must still be loaded into memory. MoE models are often memory-bandwidth bound during decode because the weights are large even though the compute per token is small.

Check Yourself

conceptualQ1

What does the router in a MoE layer do?

It routes tokens to different transformer layers based on their contentIt scores each expert for a given token and selects the top-k experts to process that tokenIt splits the hidden state into chunks and sends each chunk to a different expert

conceptualQ2

If a MoE layer has 8 experts and uses top-2 routing, how are the two expert outputs combined?

The outputs are concatenated to double the hidden dimensionOnly the highest-scoring expert output is used; the second is discardedThe outputs are summed using the renormalized router probabilities as weights

conceptualQ3

In build_moe_ffn(), what is the role of the softmax applied to the router output?

It converts raw router logits into a probability distribution over experts, so the top-k weights are meaningfulIt normalizes the expert FFN outputs before aggregationIt determines the hidden dimension each expert will use