The FFN Transforms Each Token Independently
After attention has mixed information between tokens, the feed-forward network (FFN) processes each token independently. No token looks at any other token here — it is a per-position transformation.
The FFN first expands the hidden state to a larger dimension (typically 4× wider), applies a nonlinear activation function, then projects back to the original dimension. Why is the nonlinearity critical? Without it, stacking two linear projections (expand then contract) is mathematically equivalent to a single linear projection — the model cannot learn anything new. The activation function breaks this equivalence and lets the model represent complex, non-linear feature interactions.
Modern models often use a gated variant (SwiGLU), where two parallel projections are computed in the expanded space — one acts as a learned gate that controls how much of the other passes through. The gate selectively amplifies or suppresses features, giving the FFN finer-grained control. This is why you may see the FFN described as having three weight matrices instead of two.
One token, d_model=4, d_ff=8 (expansion factor 2×):
Same operation runs independently on every token. No cross-token interaction.
The expand-then-contract pattern is not arbitrary. It creates a wide intermediate representation where the activation function can do useful work. Consider what would happen without the expansion: the model would apply a nonlinearity to a d_model-dimensional vector, where every dimension is already heavily used to encode information. There would be little room for the activation to selectively amplify or suppress features.
By projecting to a wider space (d_ff = 4× d_model is typical), the model gives the activation function many more dimensions to work with. Some of those dimensions will be strongly activated (high values after SiLU), others will be near zero (suppressed). The down-projection then selects and recombines the surviving features back into d_model dimensions. The effect is a learned, nonlinear feature-mixing operation.
In the gated variant, two parallel projections are computed from the same input: x · W_gate
and x · W_up. The gate projection passes through SiLU (a smooth activation), then multiplies
elementwise with the up projection. This means the gate learns a soft mask over the expanded features:
- Where the gate output is near 1, the up-projection feature passes through.
- Where the gate output is near 0, the feature is suppressed.
This gives the model finer-grained control than a simple activate-everything approach. The cost is one extra projection matrix (W_gate), which is why gated FFNs have 3 weight matrices instead of 2. Empirically, SwiGLU outperforms the older ReLU-based FFN at the same parameter count.
Standard FFN:
Gated variant (SwiGLU):
In llama.cpp, build_ffn() in src/llama-graph.cpp handles the FFN computation. It supports different FFN types (plain, gated, SwiGLU). The gate and up projections often run in parallel since they are independent matrix multiplies on the same input.
The FFN is often the most parameter-heavy part of a layer. With d_ff = 4 × d_model, the two (or three) weight matrices contain more parameters than the attention weights. FFN cost scales with n_tokens × d_model × d_ff — linear in sequence length, unlike attention's quadratic scaling. For short sequences, FFN dominates compute; for long sequences, attention dominates.
What is the key difference between what attention does and what the FFN does?
If d_model = 256 and d_ff = 1024, what is the shape of W_up?