L39

Case Study: Reading a Dense Gemma Block

18 min

Question

What does a dense Gemma block look like in real code?

Intuition

You have learned every concept individually: RMSNorm, attention, residual connections, FFN. Now it is time to see them assembled in one place. The Gemma model builder in llama.cpp constructs a single transformer block in a straightforward sequence, and every line maps to a concept you already know.

A dense Gemma block follows the pre-norm transformer pattern exactly:

RMSNorm the hidden states (attention pre-norm)
Self-attention — Q, K, V projections, scaled dot-product, output projection
Residual add — add attention output back to the input
RMSNorm the result (FFN pre-norm)
Feed-forward network — gate projection, up projection, activation, down projection
Residual add — add FFN output back to the post-attention residual

That is the entire block. No surprises, no hidden steps. The builder code reads almost like pseudocode because each call corresponds directly to the mathematical operation it implements.

Toy Example

Tracing one token's hidden state through a single dense Gemma block:

input: x = [0.5, −1.2, 0.8, 2.1]

phase 1: x_norm = RMSNorm(x)

phase 2: attn_out = self_attention(x_norm)

phase 3: r1 = x + attn_out (residual)

phase 4: r1_norm = RMSNorm(r1)

phase 5: ffn_out = FFN(r1_norm)

phase 6: output = r1 + ffn_out (residual)

Two sub-blocks (attention, FFN), each wrapped in norm-then-residual. This is all a dense block does.

Shapes

Input: [n_tokens, d_model]

After attention pre-norm: [n_tokens, d_model]

After attention + residual: [n_tokens, d_model]

After FFN pre-norm: [n_tokens, d_model]

After FFN + residual: [n_tokens, d_model]

Shape is preserved at every stage. Only the values change.

Math

The complete dense block in one formula chain:

h' = x + attention(RMSNorm(x))

h'' = h' + FFN(RMSNorm(h'))

Two residual sub-blocks. Norm before each transform. That is the entire layer.

The Real Code

Here is the actual layer loop from src/models/gemma.cpp in llama.cpp. This is not pseudocode — it is the builder that constructs the computation graph for one dense Gemma block, repeated for every layer. Read it line by line and match each call to the six phases above.

for (int il = 0; il < n_layer; ++il) {

    // Phase 1: attention pre-norm
    cur = build_norm(inpL,
            model.layers[il].attn_norm, NULL,
            LLM_NORM_RMS, il);

    // Phase 2: self-attention (Q, K, V projections → RoPE → attention → output projection)
    {
        ggml_tensor * Qcur = build_lora_mm(model.layers[il].wq, cur);
        ggml_tensor * Kcur = build_lora_mm(model.layers[il].wk, cur);
        ggml_tensor * Vcur = build_lora_mm(model.layers[il].wv, cur);

        // ... reshape to heads, apply RoPE, scale queries ...

        cur = build_attn(inp_attn,
                model.layers[il].wo, NULL,
                Qcur, Kcur, Vcur, nullptr, nullptr, nullptr, 1.0f, il);
    }

    // Phase 3: residual add (attention output + input)
    ggml_tensor * sa_out = ggml_add(ctx0, cur, inpL);

    // Phase 4: FFN pre-norm
    cur = build_norm(sa_out,
            model.layers[il].ffn_norm, NULL,
            LLM_NORM_RMS, il);

    // Phase 5: feed-forward network (gate, up, down projections)
    {
        cur = build_ffn(cur,
                model.layers[il].ffn_up,   NULL, NULL,
                model.layers[il].ffn_gate, NULL, NULL,
                model.layers[il].ffn_down, NULL, NULL,
                NULL,
                LLM_FFN_GELU, LLM_FFN_PAR, il);
    }

    // Phase 6: residual add (FFN output + post-attention residual)
    cur = ggml_add(ctx0, cur, sa_out);

    inpL = cur;  // output becomes next layer's input
}

Source: ggml-org/llama.cpp @ 94ca829b — src/models/gemma.cpp. The actual code may have changed since this snapshot.

Notice: there is no magic. build_norm applies RMSNorm. build_lora_mm runs a matrix multiply (the Q, K, V projections). build_attn runs the full attention mechanism. build_ffn runs the gated feed-forward network. ggml_add is the residual connection. The six phases from the Intuition section above map directly onto these function calls.

Implementation Hook

The code above is from the Gemma model builder. Each model file in src/models/ constructs its architecture by calling shared helpers from src/llama-graph.cpp. The helpers (build_norm, build_attn, build_ffn) are reused across all model architectures — what changes between models is which helpers are called and in what order.

src/models/gemma.cpp — layer loop (L22)

Performance Hook

In a dense block, all tokens pass through every parameter. The attention phase cost scales with sequence length (token-to-token interactions), while the FFN phase cost is fixed per token regardless of sequence length. During prefill with long contexts, attention compute dominates. During single-token decode, FFN projections account for most of the per-token FLOPs because there is only one query token — though attention still contributes memory traffic from reading the KV cache.

Check Yourself

conceptualQ1

What are the six phases of a dense Gemma block, in order?

Attention, norm, residual, FFN, norm, residualNorm, attention, residual, norm, FFN, residualNorm, FFN, residual, norm, attention, residual

conceptualQ2

Why does the builder call build_norm() twice per layer?

One normalizes the attention weights, the other normalizes the FFN weightsOne is the attention pre-norm, the other is the FFN pre-norm — each sub-block gets its own normalizationThe second call is a post-norm applied after the entire block

conceptualQ3

What is the residual connection doing in each sub-block?

Replacing the input with the transformed outputConcatenating the input and output to double the dimensionAdding the sub-block output back to its input, so the layer only needs to learn the change