M4/Transformer Block
L16A

Attention Needs Positional Information

16 min
How does attention know token order?

Attention computes dot products between queries and keys. Dot products are symmetric — they only care about what the vectors contain, not where they came from. If you reordered the tokens, the computation would just permute accordingly — the model has no way to distinguish different orderings or encode token order.

But token order matters enormously for language. "The dog bit the man" and "The man bit the dog" have the same tokens but very different meanings. Attention needs to know position.

RoPE (Rotary Position Embeddings) solves this by rotating the Q and K vectors based on their position in the sequence. Each position gets a unique rotation, so the dot product between Q and K naturally encodes how far apart they are. Tokens that are close together have a different interaction pattern than tokens that are far apart.

Critically, RoPE is applied only to Q and K — not to V. Position affects where tokens attend (the score computation), not what information is passed along (the values).

Two tokens at positions 0 and 1, with 2D Q/K vectors:

before RoPE:
Q₀ = [1.0, 0.0]   Q₁ = [1.0, 0.0] (identical)
after RoPE (rotation by position × θ):
Q₀ = [1.0, 0.0]   (position 0: no rotation)
Q₁ = [0.54, 0.84]   (position 1: rotated by θ)

Same original vector, but now distinguishable by position. The dot product between Q and K encodes relative distance.

Q before RoPE: [n_tokens, d_head]
Q after RoPE: [n_tokens, d_head]   (same shape, rotated values)
K before RoPE: [n_tokens, d_head]
K after RoPE: [n_tokens, d_head]   (same shape, rotated values)
V is unchanged — RoPE only touches Q and K.

RoPE's cleverness lies in one mathematical property: when you rotate Q at position m and K at position n by position-dependent angles, their dot product depends only on the relative distance (m - n), not on the absolute positions. This means the attention score between "word A at position 5" and "word B at position 3" is the same as between "word A at position 105" and "word B at position 103" — a distance of 2 produces the same rotational effect regardless of where in the sequence you are.

This relative-position property is extremely valuable for language. Whether "the cat" appears at the beginning or middle of a sentence, the model should attend to "cat" from "the" in the same way. Absolute position embeddings (an older technique) cannot guarantee this; RoPE provides it by construction.

RoPE processes the d_head dimensions in pairs. Dimensions (0, 1) form one pair, (2, 3) form another, and so on. Each pair is rotated by an angle that depends on both the token position and a base frequency for that pair:

  • Low-index pairs (dimensions 0-1, 2-3) use high-frequency rotations — they encode fine-grained, nearby positional differences.
  • High-index pairs (dimensions 126-127) use very low-frequency rotations — they encode long-range positional structure.

This multi-frequency structure is analogous to how different digits of a number encode different scales: the ones digit changes fast, the tens digit changes slower, and the hundreds digit changes slowest. The model can read position at multiple granularities simultaneously.

For each dimension pair (2i, 2i+1) at position m:

q'₂ᵢ = q₂ᵢ ⋅ cos(mθᵢ) - q₂ᵢ₊₁ ⋅ sin(mθᵢ)
q'₂ᵢ₊₁ = q₂ᵢ ⋅ sin(mθᵢ) + q₂ᵢ₊₁ ⋅ cos(mθᵢ)
m = token position, θᵢ = base frequency for dimension pair i
The same rotation is applied to corresponding K dimensions.

In llama.cpp, RoPE is applied via the ggml_rope_ext operation. It runs on Q and K right after the Q/K projections and before the attention score computation. The base frequency and scaling parameters are loaded from the model config.

RoPE is applied per-element with sin/cos — it is computationally cheap compared to the matrix multiplies for Q/K/V projection. The frequencies are precomputed once and reused. RoPE also enables extending context length at inference time via frequency scaling, which is why many long-context models use it.

Check Yourself
reasoningQ1

Token A is at position 10, token B at position 12. Later, the same tokens appear at positions 50 and 52. After RoPE, how do the Q_A·K_B scores compare between the two cases?

reasoningQ2

A model uses RoPE. You swap two adjacent tokens in the input: "big red" becomes "red big". Which part of the attention computation changes, and which stays the same?