L17

Causal Attention Lets Tokens Read the Past

18 min

Question

How does a token read information from earlier tokens?

Intuition

Each token's query asks: "which earlier tokens are relevant to me?" The answer comes from comparing the query against every key. The dot product Q ⋅ Kᵀ produces a score for every pair of positions — a matrix of "how much should I attend here?"

But in language generation, a token must not look at future tokens — that would be cheating. A causal mask sets all future-position scores to negative infinity before softmax, guaranteeing they get zero attention weight.

After masking, softmax converts the remaining scores into weights that sum to 1. Each token then computes a weighted sum of the value vectors, pulling information from the positions it attended to. This is the core mechanism for cross-token communication.

The full sequence: score → mask → softmax → weighted sum of V.

Interactive — Attention Simulator

Watch how scores are computed, masked, and converted to attention weights. See which values get mixed for each token.

Causal mask

Focus token:

Scores (Q · K^T)

	The	cat	sat
The	1.10	-∞	-∞
cat	0.75	1.47	-∞
sat	1.08	1.39	1.28

softmax →

Attention Weights

	The	cat	sat
The	1.000	0.000	0.000
cat	0.327	0.673	0.000
sat	0.279	0.380	0.341

Output for "sat" (weighted V sum)

0.279×[1, 0]+0.380×[0, 1]+0.341×[0.5, 0.5]=[0.449, 0.551]

Toy Example

3 tokens — scaled scores, masking, and softmax:

Q ⋅ Kᵀ / √d_head scores (after scaling):

2.0 1.5 0.8 1.0 3.0 0.5 0.7 2.1 1.9

after causal mask (future = -∞):

2.0 -∞ -∞ 1.0 3.0 -∞ 0.7 2.1 1.9

after softmax (rows sum to 1):

1.00 0.00 0.00 0.12 0.88 0.00 0.12 0.48 0.40

Token 1 can only attend to itself. Token 2 attends mostly to itself (0.88). Token 3 spreads attention across all three positions.

What the Weighted Sum Actually Computes

After softmax, each row of the attention weight matrix tells one token how much to "listen to" each earlier token. But what is it listening for? The answer is the value vectors.

For each token position i, the output is:

output[i] = w[i,0] × V[0] + w[i,1] × V[1] + ... + w[i,i] × V[i]

This is a weighted average of value vectors. If token 3 has attention weights [0.12, 0.48, 0.40] over positions 1, 2, 3, it gets 12% of position 1's value, 48% of position 2's value, and 40% of its own value. The output is a blend — a new vector that mixes information from the attended positions in the proportions the scores determined.

This is the fundamental mechanism by which tokens exchange information. The query decides where to look (via scores), and the values carry what information flows. Changing W_Q and W_K changes which positions are attended. Changing W_V changes what information is extracted from those positions.

Why Scale by √d_head?

The dot product Q[i] · K[j] sums d_head terms. If each term has variance ~1, the sum has variance ~d_head. With d_head = 128, the raw scores can easily reach magnitudes of 10-20, which pushes softmax into saturation — one position gets nearly all the weight and the rest get nearly zero. The model loses its ability to attend to multiple positions.

Dividing by √d_head normalizes the variance back to ~1, keeping scores in a range where softmax produces a useful (not-too-sharp) distribution. This is not a learned parameter — it is a fixed constant derived from the dimensionality. You saw a related effect in L07: doubling Q magnitudes doubles the dot product. Scaling by √d_head counteracts this dimensional growth.

Shapes

Q: [n_tokens, d_head] K: [n_tokens, d_head]

Scores Q ⋅ Kᵀ: [n_tokens, n_tokens]

After mask + softmax: [n_tokens, n_tokens] (lower-triangular, rows sum to 1)

V: [n_tokens, d_head]

Output = weights × V: [n_tokens, d_head]

Math

scores = Q ⋅ Kᵀ / √d_head

scores[i,j] = -∞ where j > i (causal mask)

weights = softmax(scores, dim=-1)

output = weights ⋅ V

The √d_head scaling prevents dot products from growing too large as the dimension increases, which would push softmax into saturation.

Implementation Hook

In llama.cpp, each model builder (e.g. src/models/gemma.cpp) projects Q, K, V and applies RoPE. Then build_attn() in src/llama-graph.cpp takes the already-rotated Q and K and handles the rest: score computation, causal masking, softmax, weighted V sum, and output projection.

src/llama-graph.cpp — build_attn() (L1990)

Performance Hook

The Q ⋅ Kᵀ computation produces an [n_tokens, n_tokens] matrix — cost grows quadratically with sequence length. This quadratic scaling is why long-context inference is expensive. Later modules will cover techniques that reduce or avoid materializing this full matrix.

Check Yourself

reasoningQ1

In a 5-token sequence (positions 0-4), the token at position 3 has pre-mask scores [2.1, 0.5, 1.8, 3.0, 0.9] over all 5 positions. After the causal mask, which positions can have non-zero attention weight?

All 5 — the mask only reduces scores, it does not zero themPositions 0, 1, 2, and 3 only — position 4 is masked to -∞ and gets zero weight after softmaxOnly position 3 — a token can only attend to itself

mathQ2

Given Q = [[1,0],[0,1],[1,1]], K = [[1,0],[0,1],[1,1]] (3 tokens, d_head=2), what is the raw score for token 3 attending to token 1 (before scaling and masking)?

012