M5/Architecture Variants
L24

The MoE Router Scores Experts Per Token

16 min
How does the router choose experts?

A Mixture-of-Experts (MoE) layer replaces the single FFN with several smaller FFNs called experts. Not every expert runs for every token. A small network called the router decides which experts each token should use.

The router is a simple linear projection: it takes the token's hidden state and produces one score (a router logit) per expert. These logits measure how relevant each expert is for this particular token.

The logits are then passed through softmax (or sigmoid, depending on the architecture) to get gating weights. In the softmax variant (which this course focuses on), the weights sum to 1. Why not just pick the top-k logits directly? Because the gating weights serve double duty: they select which experts to run and they determine how much each expert's output contributes to the final result. The sum-to-1 constraint ensures the combined output is a proper weighted average — not an arbitrarily scaled sum. Other routing variants exist (sigmoid gating, expert choice), but the softmax top-k pattern is the one you will encounter in the case studies.

Each token may pick a different set of experts. This is what makes MoE "sparse": the model has many total parameters (all the experts), but each token only activates a small subset. A model with 64 experts and top-2 routing uses only about 3% of its expert parameters per token — so it can be very large in parameter count while keeping per-token compute manageable.

4 experts, top-2 selection. Router logits for one token:

logits:   [2.1,  0.3,  3.5,  -0.8]
softmax: [0.19, 0.03, 0.77, 0.01]   (sum = 1.0, approx.)
top-2: Expert 2 (weight 0.77), Expert 0 (weight 0.19)

Experts 1 and 3 are not activated for this token. They do zero work.

In an ideal MoE, different tokens go to different experts — the workload is spread evenly. But without explicit encouragement, training can collapse into a few "popular" experts that handle most tokens while others are rarely used. This is called expert collapse.

Why does this happen? The router is trained to maximize prediction quality. If expert 2 happens to become slightly better early in training, the router sends more tokens to it, which gives it more gradient updates, which makes it even better — a positive feedback loop. Meanwhile, unused experts stagnate.

Training addresses this with a load-balancing loss: an auxiliary objective that penalizes the model when the token distribution across experts is too uneven. This is a training concern (not an inference concern), but you should know about it because it explains why real MoE models work well instead of degenerating — and why you sometimes see a "balance_loss" term in training configs.

MoE lets you scale the model's total knowledge (parameter count) independently from its per-token cost (active compute). A dense 7B-parameter model activates all 7B parameters for every token. A MoE model with 64 experts of 1B each has 64B total parameters but only activates ~2B per token (with top-2 routing), plus shared non-expert layers. The per-token compute is similar to a small dense model, but the total parameter count — and the capacity to specialize — is much larger.

The catch: all 64B parameters must fit in memory, even though only 2B are computed per token. MoE models are memory-heavy and bandwidth-heavy (loading expert weights), even if they are not compute-heavy. This is why MoE models are often memory-bandwidth-bound during decode — a bottleneck you will see in detail in the inference module, amplified here by the large total parameter count.

Hidden state per token: [d_model]
Router weight matrix: [d_model, n_experts]
Router logits per token: [n_experts]
After gating: top-k indices and their normalized weights.
logits = x · W_router   // [d_model] · [d_model, n_experts] → [n_experts]
gates = softmax(logits)   // or sigmoid(logits)
selected = top_k(gates, k)

In llama.cpp, the router is a linear layer whose output dimension equals the number of experts. The gating function (softmax or sigmoid) and top-k value are model hyperparameters. You will see this in build_moe_ffn().

The router itself is cheap — a single matrix-vector multiply. The cost that matters is what happens next: only k of n experts run. If a model has 64 experts and top-2 routing, each token uses 2/64 = about 3.1% of the expert parameters. The total parameter count is large, but the active compute per token is small.

Check Yourself
mathQ1

Given router logits [1.0, 4.0, 0.5, 3.0] and top-2 selection, which experts are chosen?

shapeQ2

What is the shape of the router output for a single token if the model has 16 experts?