Putting the Whole Decoder Block Together
You now know every piece. A decoder block is just these pieces in a fixed order:
- 1. RMSNorm the hidden states
- 2. Multi-head causal attention (Q/K/V projections, RoPE, masked scores, softmax, weighted V sum, output projection)
- 3. Add residual (attention output + original input)
- 4. RMSNorm the result
- 5. FFN (expand, activate, contract — per token)
- 6. Add residual (FFN output + input to step 4)
That is the entire block. Token mixing happens in step 2. Per-token transformation happens in step 5. The norms stabilize scale. The residuals preserve information. A model with N layers repeats this block N times.
The ordering of operations inside a block is not arbitrary. Each design choice has a specific reason:
Pre-norm, not post-norm
The RMSNorm comes before each sub-layer (attention, FFN), not after. This is called pre-norm and it matters for training stability. In the post-norm variant (the original transformer paper), gradients must flow through the norm layer during backpropagation, which can cause vanishing gradients in deep networks. Pre-norm places the norm inside the residual branch, so the residual connection provides a clean gradient path regardless of what happens inside the norm and sub-layer. Practically all modern LLMs use pre-norm.
Attention before FFN
Attention is the token-mixing step: it lets each position read from other positions. The FFN is the per-token step: it transforms each position independently. The ordering matters because the FFN's per-token transformation is more useful when it operates on tokens that have already been enriched with cross-position information from attention. If FFN came first, it would transform tokens in isolation, then attention would mix those independently-transformed tokens — a less powerful composition.
Two residual connections, not one
Each sub-layer (attention, FFN) gets its own residual connection. Why not one residual around the whole block? Because separate residuals give each sub-layer a direct path to the input. If attention produces something unhelpful, the residual ensures the FFN still sees the original information. This makes the network more robust and easier to train — each sub-layer only needs to learn the delta to add.
Understanding where the parameters live helps you reason about model size and cost. For a typical block with d_model = 4096, n_heads = 32, d_head = 128, and d_ff = 14336 (a common ratio of ~3.5× d_model):
The FFN accounts for about 72% of each block's parameters. (Later, you will see variants like GQA that reduce K/V heads and shrink the attention share further.) A 32-layer model has 32 × 243M ≈ 7.8B parameters in its blocks alone, which is most of the model's total parameter count (embedding and output head add relatively little).
Each layer refines the hidden states slightly. But what does "refine" mean across 32 passes?
Empirical studies on transformer internals show a progression. Early layers (1-8) tend to learn local patterns: which tokens are adjacent, basic syntactic structure, common phrases. The residual stream after these layers carries mostly surface-level features. Middle layers (9-24) build semantic representations: subject-verb agreement across distances, entity tracking, basic reasoning chains. The residual stream becomes increasingly abstract. Late layers (25-32) sharpen predictions: they specialize the representation toward the specific next-token prediction task, often recovering surface details needed for generation.
This matters for performance reasoning: not all layers are equally important for a given task. Some pruning and distillation techniques exploit this by removing or simplifying middle layers that contribute less to specific downstream tasks. But for general-purpose generation, every layer contributes.
Full block flow for hidden states h:
The complete block, written compactly:
Two lines. That is the entire decoder block. Everything you learned in this module fills in the details of these two lines.
In llama.cpp, each model architecture (e.g. src/models/gemma.cpp) implements a build_graph() function that loops over layers. Inside each iteration, it calls build_norm(), build_attn(), adds the residual, calls build_norm() again, build_ffn(), and adds the second residual. The pattern matches the pseudocode above almost line for line.
The full block cost is dominated by matrix multiplies: three for Q/K/V projection, one for the attention output projection, and two or three for the FFN. The norms, residual adds, softmax, and masking are comparatively cheap. For a model with N layers, every operation in this block runs N times per token generated.
Trace the shapes through one complete decoder block. The model has 10 tokens, d_model=512, n_heads=8 (so d_head=64), and d_ff=2048.
What is the correct order of operations inside a single decoder block (pre-norm architecture)?
In the decoder block, which step is the token-mixing step (where tokens exchange information)?