L20

Putting the Whole Decoder Block Together

12 min

Question

What does the full decoder block look like?

Intuition

You now know every piece. A decoder block is just these pieces in a fixed order:

1. RMSNorm the hidden states
2. Multi-head causal attention (Q/K/V projections, RoPE, masked scores, softmax, weighted V sum, output projection)
3. Add residual (attention output + original input)
4. RMSNorm the result
5. FFN (expand, activate, contract — per token)
6. Add residual (FFN output + input to step 4)

That is the entire block. Token mixing happens in step 2. Per-token transformation happens in step 5. The norms stabilize scale. The residuals preserve information. A model with N layers repeats this block N times.

Why This Specific Order

The ordering of operations inside a block is not arbitrary. Each design choice has a specific reason:

Pre-norm, not post-norm

The RMSNorm comes before each sub-layer (attention, FFN), not after. This is called pre-norm and it matters for training stability. In the post-norm variant (the original transformer paper), gradients must flow through the norm layer during backpropagation, which can cause vanishing gradients in deep networks. Pre-norm places the norm inside the residual branch, so the residual connection provides a clean gradient path regardless of what happens inside the norm and sub-layer. Practically all modern LLMs use pre-norm.

Attention before FFN

Attention is the token-mixing step: it lets each position read from other positions. The FFN is the per-token step: it transforms each position independently. The ordering matters because the FFN's per-token transformation is more useful when it operates on tokens that have already been enriched with cross-position information from attention. If FFN came first, it would transform tokens in isolation, then attention would mix those independently-transformed tokens — a less powerful composition.

Two residual connections, not one

Each sub-layer (attention, FFN) gets its own residual connection. Why not one residual around the whole block? Because separate residuals give each sub-layer a direct path to the input. If attention produces something unhelpful, the residual ensures the FFN still sees the original information. This makes the network more robust and easier to train — each sub-layer only needs to learn the delta to add.

Parameter Count Per Block

Understanding where the parameters live helps you reason about model size and cost. For a typical block with d_model = 4096, n_heads = 32, d_head = 128, and d_ff = 14336 (a common ratio of ~3.5× d_model):

Attention sub-layer:

W_Q: 4096 × 4096 = 16.8M params

W_K: 4096 × 4096 = 16.8M params

W_V: 4096 × 4096 = 16.8M params

W_O: 4096 × 4096 = 16.8M params

Subtotal: 67.1M params (4 × d_model²)

FFN sub-layer (SwiGLU):

W_gate: 4096 × 14336 = 58.7M params

W_up: 4096 × 14336 = 58.7M params

W_down: 14336 × 4096 = 58.7M params

Subtotal: 176.2M params (3 × d_model × d_ff)

RMSNorm: 2 × 4096 = 8,192 params (negligible)

Total per block: ~243M params

The FFN accounts for about 72% of each block's parameters. (Later, you will see variants like GQA that reduce K/V heads and shrink the attention share further.) A 32-layer model has 32 × 243M ≈ 7.8B parameters in its blocks alone, which is most of the model's total parameter count (embedding and output head add relatively little).

What 32 Layers Accumulate

Each layer refines the hidden states slightly. But what does "refine" mean across 32 passes?

Empirical studies on transformer internals show a progression. Early layers (1-8) tend to learn local patterns: which tokens are adjacent, basic syntactic structure, common phrases. The residual stream after these layers carries mostly surface-level features. Middle layers (9-24) build semantic representations: subject-verb agreement across distances, entity tracking, basic reasoning chains. The residual stream becomes increasingly abstract. Late layers (25-32) sharpen predictions: they specialize the representation toward the specific next-token prediction task, often recovering surface details needed for generation.

This matters for performance reasoning: not all layers are equally important for a given task. Some pruning and distillation techniques exploit this by removing or simplifying middle layers that contribute less to specific downstream tasks. But for general-purpose generation, every layer contributes.

Toy Example

Full block flow for hidden states h:

input: h [n_tokens, d_model]

h_norm = RMSNorm(h)

attn_out = multi_head_attention(h_norm)

h = h + attn_out // residual

h_norm = RMSNorm(h)

ffn_out = FFN(h_norm)

h = h + ffn_out // residual

output: h [n_tokens, d_model] (same shape)

Shapes

Block input: [n_tokens, d_model]

Block output: [n_tokens, d_model]

Every operation inside the block preserves this shape. Internally, attention expands to [n_heads, n_tokens, n_tokens] for scores (per head) and the FFN expands to [n_tokens, d_ff], but the block boundary shape is always [n_tokens, d_model].

Math

The complete block, written compactly:

h = h + attention(RMSNorm(h)) // token mixing + residual

h = h + FFN(RMSNorm(h)) // per-token transform + residual

Two lines. That is the entire decoder block. Everything you learned in this module fills in the details of these two lines.

Implementation Hook

In llama.cpp, each model architecture (e.g. src/models/gemma.cpp) implements a build_graph() function that loops over layers. Inside each iteration, it calls build_norm(), build_attn(), adds the residual, calls build_norm() again, build_ffn(), and adds the second residual. The pattern matches the pseudocode above almost line for line.

src/models/gemma.cpp — layer loop (L22)

Performance Hook

The full block cost is dominated by matrix multiplies: three for Q/K/V projection, one for the attention output projection, and two or three for the FFN. The norms, residual adds, softmax, and masking are comparatively cheap. For a model with N layers, every operation in this block runs N times per token generated.

Shape Trace: Full Block

Trace the shapes through one complete decoder block. The model has 10 tokens, d_model=512, n_heads=8 (so d_head=64), and d_ff=2048.

One Decoder Block (10 tokens, d_model=512)

Block input

[,]

→After RMSNorm

[,]

→Q per head(8 heads, d_head=64)

[,]

→Attention scores per head(Q·Kᵀ)

[,]

→Attention output(all heads concatenated + W_O)

[,]

→After first residual add

[,]

→After second RMSNorm

[,]

→FFN up-projection(d_ff=2048)

[,]

→FFN down-projection(back to d_model)

[,]

→Block output(after second residual)

[,]

Check Yourself

conceptualQ1

What is the correct order of operations inside a single decoder block (pre-norm architecture)?

Attention → FFN → RMSNorm → ResidualRMSNorm → Attention → Add Residual → RMSNorm → FFN → Add ResidualFFN → RMSNorm → Attention → RMSNorm → Residual

shapeQ2

In the decoder block, which step is the token-mixing step (where tokens exchange information)?

RMSNorm — it normalizes across all tokensMulti-head causal attention — tokens read from other positions via Q·K^T scores and weighted V sumsFFN — it processes the full sequence through its weight matrices