M4/Transformer Block
L18

Attention Output Is a Token-Mixing Operation

12 min
What does the attention output actually produce?

The attention output for each token is a weighted mix of value vectors from the positions it attended to. The weights came from the Q-K score computation. The values carry the actual content.

This is where cross-token information actually moves. The score computation (Q ⋅ Kᵀ) decided where to look. The output computation (weights × V) determines what information arrives. These are two distinct steps — and it is important not to conflate them.

After the weighted sum of V, the result passes through one more linear projection — the output projection WO. Why is this needed? The weighted sum lives in d_head-dimensional space (or h×d_head for multi-head attention after concatenation). The residual stream expects d_model dimensions. WO maps the concatenated multi-head output back to d_model, and in doing so, it learns how to combine the information from different heads. Without WO, the heads could not interact — their outputs would just be stacked, not integrated.

The projected result is what gets added to the residual stream. This is the complete attention sub-layer: project to Q/K/V → compute scores → mask → softmax → weighted V sum → project with WO → add residual.

Token 3 has attention weights [0.1, 0.6, 0.3] over 3 positions:

V₁ = [1.0, 0.0]   V₂ = [0.0, 1.0]   V₃ = [0.5, 0.5]
output₃ = 0.1 × [1.0, 0.0] + 0.6 × [0.0, 1.0] + 0.3 × [0.5, 0.5]
          = [0.10, 0.00] + [0.00, 0.60] + [0.15, 0.15]
          = [0.25, 0.75]

Token 3's output is dominated by V₂ (weight 0.6) — it pulled most of its information from position 2.

After the weighted sum, each token's output vector is a mix of value vectors from the positions it attended to. What does this mix actually represent?

Consider token 3 with weights [0.1, 0.6, 0.3]. Its output is 10% V1 + 60% V2 + 30% V3. If V2 encodes "subject of the sentence," then token 3's output carries mostly that information. The attention mechanism has routed the subject information to token 3's representation — even though token 3 might be a verb or preposition. This is how context gets built: tokens absorb information from the positions that matter.

The shape of the attention weights determines the quality of information routing:

  • Peaked: weights like [0.01, 0.95, 0.04] — the token focuses almost entirely on one position. The output is essentially a copy of that position's V vector. This happens when one position is overwhelmingly relevant (e.g., a pronoun attending to its antecedent).
  • Uniform: weights like [0.33, 0.33, 0.34] — the token spreads attention evenly. The output is an average of all V vectors. This can happen in early layers where the model has not yet learned fine-grained attention patterns, or when all positions are equally relevant.
  • Sparse: weights like [0.0, 0.0, 0.5, 0.0, 0.5] — the token focuses on a few specific positions. This is the typical pattern in well-trained models: each head learns to attend to specific grammatical or semantic relationships.
Attention weights per head: [n_tokens, n_tokens]   (from softmax)
V per head: [n_tokens, d_head]
Weighted sum per head: weights × V → [n_tokens, d_head]
Concatenate all n_heads outputs: [n_tokens, n_heads × d_head]
Output projection W_O: [n_heads × d_head, d_model]
Final attention output: [n_tokens, d_model]
attn_out = softmax(Q ⋅ Kᵀ / √d_head) ⋅ V   // [n_tokens, d_head]
output = attn_out ⋅ W_O   // project back to d_model

In llama.cpp, build_attn() in src/llama-graph.cpp computes the weighted sum of V and applies the output projection (W_O). The result is then added to the residual stream. This is the final step of the attention sub-layer.

The weights × V multiply is [n_tokens, n_tokens] × [n_tokens, d_head], costing n_tokens² × d_head operations. The output projection W_O is an additional matrix multiply. Together, these represent a significant portion of attention's total cost.

Check Yourself
conceptualQ1

What is the difference between the score computation and the value mixing in attention?

mathQ2

If a token has attention weights [0.0, 0.0, 1.0] over 3 positions, what is its attention output?