Causal Attention Lets Tokens Read the Past
Each token's query asks: "which earlier tokens are relevant to me?" The answer comes from comparing the query against every key. The dot product Q ⋅ Kᵀ produces a score for every pair of positions — a matrix of "how much should I attend here?"
But in language generation, a token must not look at future tokens — that would be cheating. A causal mask sets all future-position scores to negative infinity before softmax, guaranteeing they get zero attention weight.
After masking, softmax converts the remaining scores into weights that sum to 1. Each token then computes a weighted sum of the value vectors, pulling information from the positions it attended to. This is the core mechanism for cross-token communication.
The full sequence: score → mask → softmax → weighted sum of V.
Watch how scores are computed, masked, and converted to attention weights. See which values get mixed for each token.
| The | cat | sat | |
|---|---|---|---|
| The | 1.10 | -∞ | -∞ |
| cat | 0.75 | 1.47 | -∞ |
| sat | 1.08 | 1.39 | 1.28 |
| The | cat | sat | |
|---|---|---|---|
| The | 1.000 | 0.000 | 0.000 |
| cat | 0.327 | 0.673 | 0.000 |
| sat | 0.279 | 0.380 | 0.341 |
3 tokens — scaled scores, masking, and softmax:
Token 1 can only attend to itself. Token 2 attends mostly to itself (0.88). Token 3 spreads attention across all three positions.
After softmax, each row of the attention weight matrix tells one token how much to "listen to" each earlier token. But what is it listening for? The answer is the value vectors.
For each token position i, the output is:
This is a weighted average of value vectors. If token 3 has attention weights [0.12, 0.48, 0.40] over positions 1, 2, 3, it gets 12% of position 1's value, 48% of position 2's value, and 40% of its own value. The output is a blend — a new vector that mixes information from the attended positions in the proportions the scores determined.
This is the fundamental mechanism by which tokens exchange information. The query decides where to look (via scores), and the values carry what information flows. Changing WQ and WK changes which positions are attended. Changing WV changes what information is extracted from those positions.
The dot product Q[i] · K[j] sums d_head terms. If each term has variance ~1, the sum has variance ~d_head. With d_head = 128, the raw scores can easily reach magnitudes of 10-20, which pushes softmax into saturation — one position gets nearly all the weight and the rest get nearly zero. The model loses its ability to attend to multiple positions.
Dividing by √d_head normalizes the variance back to ~1, keeping scores in a range where softmax produces a useful (not-too-sharp) distribution. This is not a learned parameter — it is a fixed constant derived from the dimensionality. You saw a related effect in L07: doubling Q magnitudes doubles the dot product. Scaling by √d_head counteracts this dimensional growth.
The √d_head scaling prevents dot products from growing too large as the dimension increases, which would push softmax into saturation.
In llama.cpp, each model builder (e.g. src/models/gemma.cpp) projects Q, K, V and applies RoPE. Then build_attn() in src/llama-graph.cpp takes the already-rotated Q and K and handles the rest: score computation, causal masking, softmax, weighted V sum, and output projection.
The Q ⋅ Kᵀ computation produces an [n_tokens, n_tokens] matrix — cost grows quadratically with sequence length. This quadratic scaling is why long-context inference is expensive. Later modules will cover techniques that reduce or avoid materializing this full matrix.
In a 5-token sequence (positions 0-4), the token at position 3 has pre-mask scores [2.1, 0.5, 1.8, 3.0, 0.9] over all 5 positions. After the causal mask, which positions can have non-zero attention weight?
Given Q = [[1,0],[0,1],[1,1]], K = [[1,0],[0,1],[1,1]] (3 tokens, d_head=2), what is the raw score for token 3 attending to token 1 (before scaling and masking)?