L19

The FFN Transforms Each Token Independently

15 min

Question

What does the FFN do?

Intuition

After attention has mixed information between tokens, the feed-forward network (FFN) processes each token independently. No token looks at any other token here — it is a per-position transformation.

The FFN first expands the hidden state to a larger dimension (typically 4× wider), applies a nonlinear activation function, then projects back to the original dimension. Why is the nonlinearity critical? Without it, stacking two linear projections (expand then contract) is mathematically equivalent to a single linear projection — the model cannot learn anything new. The activation function breaks this equivalence and lets the model represent complex, non-linear feature interactions.

Modern models often use a gated variant (SwiGLU), where two parallel projections are computed in the expanded space — one acts as a learned gate that controls how much of the other passes through. The gate selectively amplifies or suppresses features, giving the FFN finer-grained control. This is why you may see the FFN described as having three weight matrices instead of two.

Toy Example

One token, d_model=4, d_ff=8 (expansion factor 2×):

input: x = [0.5, -1.0, 0.3, 0.8] (4 dims)

up-project: x × W_up → [_, _, _, _, _, _, _, _] (8 dims)

activate: SiLU(expanded) (nonlinearity, still 8 dims)

down-project: activated × W_down → [_, _, _, _] (back to 4 dims)

output: [0.7, -0.4, 0.1, 1.2] (same shape, different values)

Same operation runs independently on every token. No cross-token interaction.

Why Expand, Then Contract?

The expand-then-contract pattern is not arbitrary. It creates a wide intermediate representation where the activation function can do useful work. Consider what would happen without the expansion: the model would apply a nonlinearity to a d_model-dimensional vector, where every dimension is already heavily used to encode information. There would be little room for the activation to selectively amplify or suppress features.

By projecting to a wider space (d_ff = 4× d_model is typical), the model gives the activation function many more dimensions to work with. Some of those dimensions will be strongly activated (high values after SiLU), others will be near zero (suppressed). The down-projection then selects and recombines the surviving features back into d_model dimensions. The effect is a learned, nonlinear feature-mixing operation.

What the Gate Does (SwiGLU)

In the gated variant, two parallel projections are computed from the same input: x · W_gate and x · W_up. The gate projection passes through SiLU (a smooth activation), then multiplies elementwise with the up projection. This means the gate learns a soft mask over the expanded features:

Where the gate output is near 1, the up-projection feature passes through.
Where the gate output is near 0, the feature is suppressed.

This gives the model finer-grained control than a simple activate-everything approach. The cost is one extra projection matrix (W_gate), which is why gated FFNs have 3 weight matrices instead of 2. Empirically, SwiGLU outperforms the older ReLU-based FFN at the same parameter count.

Shapes

Input: [n_tokens, d_model]

W_up: [d_model, d_ff] (expand)

After up-project + activation: [n_tokens, d_ff]

W_down: [d_ff, d_model] (contract)

Output: [n_tokens, d_model] (same shape as input)

Gated variant adds W_gate: [d_model, d_ff] for the gating projection.

Math

Standard FFN:

FFN(x) = activation(x ⋅ W_up) ⋅ W_down

Gated variant (SwiGLU):

FFN(x) = (SiLU(x ⋅ W_gate) ⊙ (x ⋅ W_up)) ⋅ W_down

⊙ = elementwise multiply (the gating mechanism)

Implementation Hook

In llama.cpp, build_ffn() in src/llama-graph.cpp handles the FFN computation. It supports different FFN types (plain, gated, SwiGLU). The gate and up projections often run in parallel since they are independent matrix multiplies on the same input.

src/llama-graph.cpp — build_ffn()

Performance Hook

The FFN is often the most parameter-heavy part of a layer. With d_ff = 4 × d_model, the two (or three) weight matrices contain more parameters than the attention weights. FFN cost scales with n_tokens × d_model × d_ff — linear in sequence length, unlike attention's quadratic scaling. For short sequences, FFN dominates compute; for long sequences, attention dominates.

Check Yourself

conceptualQ1

What is the key difference between what attention does and what the FFN does?

Attention processes tokens independently; FFN mixes information between tokensAttention mixes information between tokens; FFN transforms each token independentlyThey do the same thing but with different weight matrices

shapeQ2

If d_model = 256 and d_ff = 1024, what is the shape of W_up?

[256, 256][256, 1024][1024, 1024]