L18

Attention Output Is a Token-Mixing Operation

12 min

Question

What does the attention output actually produce?

Intuition

The attention output for each token is a weighted mix of value vectors from the positions it attended to. The weights came from the Q-K score computation. The values carry the actual content.

This is where cross-token information actually moves. The score computation (Q ⋅ Kᵀ) decided where to look. The output computation (weights × V) determines what information arrives. These are two distinct steps — and it is important not to conflate them.

After the weighted sum of V, the result passes through one more linear projection — the output projection W_O. Why is this needed? The weighted sum lives in d_head-dimensional space (or h×d_head for multi-head attention after concatenation). The residual stream expects d_model dimensions. W_O maps the concatenated multi-head output back to d_model, and in doing so, it learns how to combine the information from different heads. Without W_O, the heads could not interact — their outputs would just be stacked, not integrated.

The projected result is what gets added to the residual stream. This is the complete attention sub-layer: project to Q/K/V → compute scores → mask → softmax → weighted V sum → project with W_O → add residual.

Toy Example

Token 3 has attention weights [0.1, 0.6, 0.3] over 3 positions:

V₁ = [1.0, 0.0] V₂ = [0.0, 1.0] V₃ = [0.5, 0.5]

output₃ = 0.1 × [1.0, 0.0] + 0.6 × [0.0, 1.0] + 0.3 × [0.5, 0.5]

= [0.10, 0.00] + [0.00, 0.60] + [0.15, 0.15]

= [0.25, 0.75]

Token 3's output is dominated by V₂ (weight 0.6) — it pulled most of its information from position 2.

What the Output "Contains"

After the weighted sum, each token's output vector is a mix of value vectors from the positions it attended to. What does this mix actually represent?

Consider token 3 with weights [0.1, 0.6, 0.3]. Its output is 10% V₁ + 60% V₂ + 30% V₃. If V₂ encodes "subject of the sentence," then token 3's output carries mostly that information. The attention mechanism has routed the subject information to token 3's representation — even though token 3 might be a verb or preposition. This is how context gets built: tokens absorb information from the positions that matter.

Uniform vs Peaked Attention

The shape of the attention weights determines the quality of information routing:

Peaked: weights like [0.01, 0.95, 0.04] — the token focuses almost entirely on one position. The output is essentially a copy of that position's V vector. This happens when one position is overwhelmingly relevant (e.g., a pronoun attending to its antecedent).
Uniform: weights like [0.33, 0.33, 0.34] — the token spreads attention evenly. The output is an average of all V vectors. This can happen in early layers where the model has not yet learned fine-grained attention patterns, or when all positions are equally relevant.
Sparse: weights like [0.0, 0.0, 0.5, 0.0, 0.5] — the token focuses on a few specific positions. This is the typical pattern in well-trained models: each head learns to attend to specific grammatical or semantic relationships.

Shapes

Attention weights per head: [n_tokens, n_tokens] (from softmax)

V per head: [n_tokens, d_head]

Weighted sum per head: weights × V → [n_tokens, d_head]

Concatenate all n_heads outputs: [n_tokens, n_heads × d_head]

Output projection W_O: [n_heads × d_head, d_model]

Final attention output: [n_tokens, d_model]

Math

attn_out = softmax(Q ⋅ Kᵀ / √d_head) ⋅ V // [n_tokens, d_head]

output = attn_out ⋅ W_O // project back to d_model

Implementation Hook

In llama.cpp, build_attn() in src/llama-graph.cpp computes the weighted sum of V and applies the output projection (W_O). The result is then added to the residual stream. This is the final step of the attention sub-layer.

src/llama-graph.cpp — build_attn()

Performance Hook

The weights × V multiply is [n_tokens, n_tokens] × [n_tokens, d_head], costing n_tokens² × d_head operations. The output projection W_O is an additional matrix multiply. Together, these represent a significant portion of attention's total cost.

Check Yourself

conceptualQ1

What is the difference between the score computation and the value mixing in attention?

They are the same operation — Q·K^T produces the output directlyScores (Q·K^T) determine WHERE to attend; value mixing (weights × V) determines WHAT information is gatheredScore computation uses V; value mixing uses Q and K

mathQ2

If a token has attention weights [0.0, 0.0, 1.0] over 3 positions, what is its attention output?

An average of all three V vectorsExactly V₃ — the value vector from position 3Zero — the first two weights are zero so nothing passes through