M5/Architecture Variants
L25

MoE Runs Selected Experts and Weights Their Outputs

18 min
How are expert outputs combined?

Once the router has selected the top-k experts for a token, each selected expert processes the token independently. Every expert is a small FFN — it takes in the hidden state and produces an output of the same shape.

The final output is not a simple average. The router's gating weights are renormalized over just the selected experts so they sum to 1. Why renormalize? Because the original softmax weights summed to 1 across all experts, but we only run top-k. Without renormalization, the weights of the selected experts might sum to 0.6 or 0.9, making the output arbitrarily scaled. Renormalizing ensures the weighted sum is a proper blend. Then the expert outputs are combined as a weighted sum using these normalized weights.

If Expert A got gating weight 0.8 and Expert B got 0.2, the combined output is 80% Expert A's answer and 20% Expert B's answer. This soft blending lets the model smoothly mix information from multiple specialists.

You might wonder: why blend two experts at all? Why not just pick the single best expert?

Hard selection (top-1 routing) makes the choice discrete: one expert is fully on, the rest are fully off. This creates sparse training signals — non-selected experts' FFN weights receive no task-loss gradient because they contributed nothing to the output. The router still receives gradient through the selected expert's gate, but the signal is weaker than with top-2, making it harder to gradually improve routing decisions.

Soft blending (top-k with weighted combination) makes the choice almost discrete but with smooth gradients. Even the weaker expert contributes a little to the output, so its router weight receives gradient. The model can gradually shift routing decisions as it learns which expert handles which kinds of tokens better.

Top-2 is the most common choice: enough blending for smooth gradients, few enough experts to keep compute low. Some architectures use top-1, which gives sparser but weaker learning signals for the router. Top-2 softmax is the dominant pattern.

MoE aggregation has a distinctive memory access pattern that matters for performance. In a dense FFN, every token uses the same weight matrices — the hardware loads them once and reuses them across all tokens in the batch. In MoE, different tokens may use different experts. If token 1 selects experts A and B, and token 2 selects experts C and D, the hardware must load four different sets of weights.

In the worst case (no overlap between tokens' expert selections), the memory traffic can be n_experts_used times higher than a dense layer. In practice, popular experts are shared across tokens, which helps. But MoE layers are generally more memory-intensive than dense layers of the same per-token compute, which is why MoE models are especially sensitive to memory bandwidth during decode.

Top-2 experts selected. Raw gating weights: Expert 2 = 0.74, Expert 0 = 0.18.

renormalize: sum = 0.74 + 0.18 = 0.92
w₂ = 0.74 / 0.92 = 0.804    w₀ = 0.18 / 0.92 = 0.196
expert outputs (d_model = 3):
Expert 2: [1.0,  0.5, -0.2]
Expert 0: [0.3, -0.1,  0.8]
output = 0.804 × [1.0, 0.5, -0.2] + 0.196 × [0.3, -0.1, 0.8]
       = [0.804+0.059, 0.402-0.020, -0.161+0.157]
       = [0.863, 0.382, -0.004]
Each expert input: [d_model]
Each expert output: [d_model]
Normalized weights: [k]   (sum to 1)
Combined output: [d_model]   same shape as if a single FFN ran
w'ᵢ = gateᵢ / ∑ⱼ gateⱼ   // renormalize over selected experts
output = ∑ᵢ w'ᵢ · expertᵢ(x)   // weighted sum of expert outputs

In llama.cpp, build_moe_ffn() handles the full pipeline: router scoring, top-k selection, dispatching tokens to their selected experts, running each expert FFN, and combining the weighted outputs. The aggregation is a scatter-gather pattern.

Each expert is a smaller FFN than the dense alternative, so individual expert compute is cheap. But different tokens may select different experts, creating irregular memory access patterns. Efficient MoE implementations batch tokens per expert to maximize GPU utilization. Load imbalance — when most tokens pick the same expert — is a real problem that auxiliary loss terms try to prevent.

Check Yourself
mathQ1

Two experts are selected with raw gating weights 0.6 and 0.2. After renormalization, what are the weights?

mathQ2

If Expert A outputs [2, 4] with weight 0.75 and Expert B outputs [6, 0] with weight 0.25, what is the combined output?