L14

What a Transformer Layer Does

8 min

Question

What does a transformer layer do?

Intuition

So far, you know how text becomes token IDs, how token IDs become embedding vectors, how projections transform vectors, how logits score the vocabulary, and how softmax produces probabilities. The missing piece is the model itself — the part that transforms initial embeddings into the rich hidden state that the output head reads.

That model is built from layers. A "layer" (also called a "block") is a fixed sequence of operations that takes in a set of hidden states — one vector per token (the [n_tokens, d_model] matrix from M2) — and produces updated hidden states of the same shape. The model repeats this layer many times: a 32-layer model applies the same pattern 32 times, each with its own learned weights. The output of layer 1 becomes the input to layer 2, and so on.

Inside each layer, two things happen:

Attention Tokens look at each other. Information moves between positions. This is token mixing.
FFN Each token is transformed independently. No cross-token communication. This is per-token transformation.

That is the entire structure. Every layer does attention then FFN, with normalization and residual connections around each. A model with 32 layers repeats this pattern 32 times, each with its own learned weights.

Why Two Phases and Not One?

Attention and FFN serve complementary roles, and neither can replace the other:

Attention alone is not enough. Attention routes information between token positions. While the softmax makes the routing data-dependent and nonlinear, the final weighted sum of V vectors is linear in V — it blends existing representations without creating new per-token features. The FFN adds per-token nonlinear capacity (via activation functions like SiLU) that creates genuinely new features from the blended input.
FFN alone is not enough. The FFN processes each token independently — it has no way to compare one token to another. Without attention, the model would treat each word in isolation, as if it had no context. Attention provides the cross-token communication.

Together, they form a complete unit: attention decides what information to gather from the sequence, and the FFN decides what to do with it. This alternating pattern — mix then transform, mix then transform — is what builds increasingly rich representations over 32+ layers.

What 32 Layers of This Actually Do

In the early layers, the model builds basic features: which words are nearby, simple syntactic patterns, local context. In the middle layers, it assembles these into more abstract features: sentence-level meaning, entity tracking, argument structure. In the deep layers, it refines the final prediction: what token is most likely next, given everything it has gathered.

Each layer reads the full hidden-state matrix produced by the previous layer and writes a new full hidden-state matrix. The shape never changes — it is always [n_tokens, d_model] in and [n_tokens, d_model] out. What changes is the content of those vectors: the information they encode becomes richer and more prediction-relevant with each successive layer.

The remaining pages in this module will open the attention and FFN boxes. By the end of M4, you will know exactly what happens inside both operations, including the math, shapes, and code.

Toy Example

3 tokens passing through one layer:

input: h₀ = [token₁_vec, token₂_vec, token₃_vec]

attention — tokens mix information

FFN — each token transformed independently

output: h₁ = [updated₁_vec, updated₂_vec, updated₃_vec]

Same number of tokens, same vector dimensions. The values changed, the shape did not.

Shapes

Layer input: [n_tokens, d_model]

Layer output: [n_tokens, d_model]

Shape is preserved. Content changes.

Math

No new formulas. The key concept is the two-phase structure:

h' = attention(h) // tokens interact

h'' = FFN(h') // each token independently

Implementation Hook

In llama.cpp, each model builder (e.g. src/models/gemma.cpp) loops over layers, calling build_attn() and build_ffn() in sequence. You will see this structure explicitly in the case studies.

src/llama-graph.cpp — build_attn() (L1990), build_ffn() (L1046)

Performance Hook

Attention and FFN have different compute profiles. Attention cost grows with sequence length (token-to-token interactions). FFN cost grows with model width and is proportional to the number of tokens being processed. Attention has an additional quadratic dependence on sequence length. Understanding which one dominates requires knowing the shapes — which you now can.

Check Yourself

conceptualQ1

Which operation allows tokens to exchange information with each other?

FFN — it processes all tokens togetherAttention — it computes interactions between token positionsBoth — they are interchangeable

shapeQ2

If the input to a layer has shape [8, 512], what is the output shape?

[8, 512][8, 1024][1, 512]