M8/Case Studies
L39

Case Study: Reading a Dense Gemma Block

18 min
What does a dense Gemma block look like in real code?

You have learned every concept individually: RMSNorm, attention, residual connections, FFN. Now it is time to see them assembled in one place. The Gemma model builder in llama.cpp constructs a single transformer block in a straightforward sequence, and every line maps to a concept you already know.

A dense Gemma block follows the pre-norm transformer pattern exactly:

  1. RMSNorm the hidden states (attention pre-norm)
  2. Self-attention — Q, K, V projections, scaled dot-product, output projection
  3. Residual add — add attention output back to the input
  4. RMSNorm the result (FFN pre-norm)
  5. Feed-forward network — gate projection, up projection, activation, down projection
  6. Residual add — add FFN output back to the post-attention residual

That is the entire block. No surprises, no hidden steps. The builder code reads almost like pseudocode because each call corresponds directly to the mathematical operation it implements.

Tracing one token's hidden state through a single dense Gemma block:

input: x = [0.5, −1.2, 0.8, 2.1]
phase 1: x_norm = RMSNorm(x)
phase 2: attn_out = self_attention(x_norm)
phase 3: r1 = x + attn_out   (residual)
phase 4: r1_norm = RMSNorm(r1)
phase 5: ffn_out = FFN(r1_norm)
phase 6: output = r1 + ffn_out   (residual)

Two sub-blocks (attention, FFN), each wrapped in norm-then-residual. This is all a dense block does.

Input: [n_tokens, d_model]
After attention pre-norm: [n_tokens, d_model]
After attention + residual: [n_tokens, d_model]
After FFN pre-norm: [n_tokens, d_model]
After FFN + residual: [n_tokens, d_model]
Shape is preserved at every stage. Only the values change.

The complete dense block in one formula chain:

h' = x + attention(RMSNorm(x))
h'' = h' + FFN(RMSNorm(h'))
Two residual sub-blocks. Norm before each transform. That is the entire layer.

Here is the actual layer loop from src/models/gemma.cpp in llama.cpp. This is not pseudocode — it is the builder that constructs the computation graph for one dense Gemma block, repeated for every layer. Read it line by line and match each call to the six phases above.

for (int il = 0; il < n_layer; ++il) {

    // Phase 1: attention pre-norm
    cur = build_norm(inpL,
            model.layers[il].attn_norm, NULL,
            LLM_NORM_RMS, il);

    // Phase 2: self-attention (Q, K, V projections → RoPE → attention → output projection)
    {
        ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);
        ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
        ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur);

        // ... reshape to heads, apply RoPE, scale queries ...

        cur = build_attn(inp_attn,
                model.layers[il].wo, NULL,
                Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f, il);
    }

    // Phase 3: residual add (attention output + input)
    ggml_tensor * sa_out = ggml_add(ctx0, cur, inpL);

    // Phase 4: FFN pre-norm
    cur = build_norm(sa_out,
            model.layers[il].ffn_norm, NULL,
            LLM_NORM_RMS, il);

    // Phase 5: feed-forward network (gate, up, down projections)
    {
        cur = build_ffn(cur,
                model.layers[il].ffn_up,   NULL, NULL,
                model.layers[il].ffn_gate, NULL, NULL,
                model.layers[il].ffn_down, NULL, NULL,
                NULL,
                LLM_FFN_GELU, LLM_FFN_PAR, il);
    }

    // Phase 6: residual add (FFN output + post-attention residual)
    cur = ggml_add(ctx0, cur, sa_out);

    inpL = cur;  // output becomes next layer's input
}

Source: ggml-org/llama.cpp @ 94ca829bsrc/models/gemma.cpp. The actual code may have changed since this snapshot.

Notice: there is no magic. build_norm applies RMSNorm. build_lora_mm runs a matrix multiply (the Q, K, V projections). build_attn runs the full attention mechanism. build_ffn runs the gated feed-forward network. ggml_add is the residual connection. The six phases from the Intuition section above map directly onto these function calls.

The code above is from the Gemma model builder. Each model file in src/models/ constructs its architecture by calling shared helpers from src/llama-graph.cpp. The helpers (build_norm, build_attn, build_ffn) are reused across all model architectures — what changes between models is which helpers are called and in what order.

In a dense block, all tokens pass through every parameter. The attention phase cost scales with sequence length (token-to-token interactions), while the FFN phase cost is fixed per token regardless of sequence length. During prefill with long contexts, attention compute dominates. During single-token decode, FFN projections account for most of the per-token FLOPs because there is only one query token — though attention still contributes memory traffic from reading the KV cache.

Check Yourself
conceptualQ1

What are the six phases of a dense Gemma block, in order?

conceptualQ2

Why does the builder call build_norm() twice per layer?

conceptualQ3

What is the residual connection doing in each sub-block?