L16

Q, K, and V Are Just Learned Projections

15 min

Question

What are Q, K, and V?

Intuition

Before attention can run, the model needs three versions of the hidden states: queries (Q), keys (K), and values (V). These are not three different inputs — they all come from the same hidden states, projected through three different learned weight matrices.

Think of it this way: Q asks "what am I looking for?", K says "what do I contain?", and V says "what information should I pass along if selected?" The Q-K interaction determines where to attend. V determines what gets passed along.

The projections are simple linear transforms — matrix multiplies with learned weights (most modern models omit the additive bias term to save parameters). The three weight matrices are the core learned parameters of attention.

Why Not Use the Hidden States Directly?

You might wonder: why not skip the projections and compute attention directly on the hidden states? Two reasons:

Separation of concerns. A token's hidden state encodes everything the model knows about that position. But for attention, you need different information for different roles: the "what am I looking for?" question (Q) is different from the "what do I contain?" answer (K), which is different from "what should I pass along?" (V). Three separate projections let the model learn to extract the right information for each role.
Dimension control. The hidden state has dimension d_model (e.g., 4096), but each attention head operates in a smaller d_head space (e.g., 128). The projections reduce the dimension, which makes the per-head dot products cheaper and lets different heads learn different subspaces.

In short: the projections are where the model learns what attention should pay attention to. Without them, every head at every layer would ask the same question of the same data — there would be no learned specialization.

Toy Example

Hidden states for 3 tokens with d_model=4, projecting to d_head=3:

h = [3 tokens, 4 dims] (hidden states)

W_Q = [4, 3] (query weights)

W_K = [4, 3] (key weights)

W_V = [4, 3] (value weights)

Q = h × W_Q → [3, 3]

K = h × W_K → [3, 3]

V = h × W_V → [3, 3]

Same input, three different projections. Same output shape, but different learned transformations.

Shapes

Hidden states: [n_tokens, d_model]

W_Q: [d_model, d_q] Q = h × W_Q → [n_tokens, d_q]

W_K: [d_model, d_k] K = h × W_K → [n_tokens, d_k]

W_V: [d_model, d_v] V = h × W_V → [n_tokens, d_v]

Typically d_q = d_k (required for dot-product attention). d_v can differ but usually matches.

Math

Each projection is a matrix multiply:

Q = h ⋅ W_Q

K = h ⋅ W_K

V = h ⋅ W_V

Three independent linear transforms of the same input.

Implementation Hook

In llama.cpp, Q/K/V projections are matrix multiplies in the attention block. Some models fuse Q, K, and V into a single large weight matrix and split the result, while others keep them as separate weight tensors. You will see both patterns in model implementations.

In the wild: What this course calls W_Q, W_K, W_V appears in model checkpoints as tensor names like model.layers.0.self_attn.q_proj.weight, ...k_proj.weight, ...v_proj.weight. Recognizing these names in weight files and config files is a practical skill you will use when inspecting models.

src/models/gemma.cpp — Q/K/V projections (L33)

Performance Hook

Q/K/V projections are three matrix multiplies per layer. For a model with d_model = 4096, each projection involves a [n_tokens, 4096] × [4096, d_head × n_heads] multiply. Fusing Q, K, V into one large multiply can improve GPU utilization by increasing the matrix size.

Check Yourself

shapeQ1

If hidden states have shape [n, 4096] and W_Q projects to d_head=128 with 32 heads, what is the shape of the full Q tensor before splitting into heads?

[n, 4096] — Q has the same shape as the input[n, 4096] — because 32 × 128 = 4096[n, 128] — one d_head column per token

shapeQ2

If hidden states have shape [5, 512] and W_Q has shape [512, 64], what is the shape of Q?

[5, 512][512, 64][5, 64]