Attention Output Is a Token-Mixing Operation
The attention output for each token is a weighted mix of value vectors from the positions it attended to. The weights came from the Q-K score computation. The values carry the actual content.
This is where cross-token information actually moves. The score computation (Q ⋅ Kᵀ) decided where to look. The output computation (weights × V) determines what information arrives. These are two distinct steps — and it is important not to conflate them.
After the weighted sum of V, the result passes through one more linear projection — the output projection WO. Why is this needed? The weighted sum lives in d_head-dimensional space (or h×d_head for multi-head attention after concatenation). The residual stream expects d_model dimensions. WO maps the concatenated multi-head output back to d_model, and in doing so, it learns how to combine the information from different heads. Without WO, the heads could not interact — their outputs would just be stacked, not integrated.
The projected result is what gets added to the residual stream. This is the complete attention sub-layer: project to Q/K/V → compute scores → mask → softmax → weighted V sum → project with WO → add residual.
Token 3 has attention weights [0.1, 0.6, 0.3] over 3 positions:
Token 3's output is dominated by V₂ (weight 0.6) — it pulled most of its information from position 2.
After the weighted sum, each token's output vector is a mix of value vectors from the positions it attended to. What does this mix actually represent?
Consider token 3 with weights [0.1, 0.6, 0.3]. Its output is 10% V1 + 60% V2 + 30% V3. If V2 encodes "subject of the sentence," then token 3's output carries mostly that information. The attention mechanism has routed the subject information to token 3's representation — even though token 3 might be a verb or preposition. This is how context gets built: tokens absorb information from the positions that matter.
The shape of the attention weights determines the quality of information routing:
- Peaked: weights like [0.01, 0.95, 0.04] — the token focuses almost entirely on one position. The output is essentially a copy of that position's V vector. This happens when one position is overwhelmingly relevant (e.g., a pronoun attending to its antecedent).
- Uniform: weights like [0.33, 0.33, 0.34] — the token spreads attention evenly. The output is an average of all V vectors. This can happen in early layers where the model has not yet learned fine-grained attention patterns, or when all positions are equally relevant.
- Sparse: weights like [0.0, 0.0, 0.5, 0.0, 0.5] — the token focuses on a few specific positions. This is the typical pattern in well-trained models: each head learns to attend to specific grammatical or semantic relationships.
In llama.cpp, build_attn() in src/llama-graph.cpp computes the weighted sum of V and applies the output projection (W_O). The result is then added to the residual stream. This is the final step of the attention sub-layer.
The weights × V multiply is [n_tokens, n_tokens] × [n_tokens, d_head], costing n_tokens² × d_head operations. The output projection W_O is an additional matrix multiply. Together, these represent a significant portion of attention's total cost.
What is the difference between the score computation and the value mixing in attention?
If a token has attention weights [0.0, 0.0, 1.0] over 3 positions, what is its attention output?