M4/Transformer Block
L16

Q, K, and V Are Just Learned Projections

15 min
What are Q, K, and V?

Before attention can run, the model needs three versions of the hidden states: queries (Q), keys (K), and values (V). These are not three different inputs — they all come from the same hidden states, projected through three different learned weight matrices.

Think of it this way: Q asks "what am I looking for?", K says "what do I contain?", and V says "what information should I pass along if selected?" The Q-K interaction determines where to attend. V determines what gets passed along.

The projections are simple linear transforms — matrix multiplies with learned weights (most modern models omit the additive bias term to save parameters). The three weight matrices are the core learned parameters of attention.

You might wonder: why not skip the projections and compute attention directly on the hidden states? Two reasons:

  1. Separation of concerns. A token's hidden state encodes everything the model knows about that position. But for attention, you need different information for different roles: the "what am I looking for?" question (Q) is different from the "what do I contain?" answer (K), which is different from "what should I pass along?" (V). Three separate projections let the model learn to extract the right information for each role.
  2. Dimension control. The hidden state has dimension d_model (e.g., 4096), but each attention head operates in a smaller d_head space (e.g., 128). The projections reduce the dimension, which makes the per-head dot products cheaper and lets different heads learn different subspaces.

In short: the projections are where the model learns what attention should pay attention to. Without them, every head at every layer would ask the same question of the same data — there would be no learned specialization.

Hidden states for 3 tokens with d_model=4, projecting to d_head=3:

h = [3 tokens, 4 dims]   (hidden states)
W_Q = [4, 3]   (query weights)
W_K = [4, 3]   (key weights)
W_V = [4, 3]   (value weights)
Q = h × W_Q → [3, 3]
K = h × W_K → [3, 3]
V = h × W_V → [3, 3]

Same input, three different projections. Same output shape, but different learned transformations.

Hidden states: [n_tokens, d_model]
W_Q: [d_model, d_q]    Q = h × W_Q → [n_tokens, d_q]
W_K: [d_model, d_k]    K = h × W_K → [n_tokens, d_k]
W_V: [d_model, d_v]    V = h × W_V → [n_tokens, d_v]
Typically d_q = d_k (required for dot-product attention). d_v can differ but usually matches.

Each projection is a matrix multiply:

Q = h ⋅ W_Q
K = h ⋅ W_K
V = h ⋅ W_V
Three independent linear transforms of the same input.

In llama.cpp, Q/K/V projections are matrix multiplies in the attention block. Some models fuse Q, K, and V into a single large weight matrix and split the result, while others keep them as separate weight tensors. You will see both patterns in model implementations.

In the wild: What this course calls WQ, WK, WV appears in model checkpoints as tensor names like model.layers.0.self_attn.q_proj.weight, ...k_proj.weight, ...v_proj.weight. Recognizing these names in weight files and config files is a practical skill you will use when inspecting models.

Q/K/V projections are three matrix multiplies per layer. For a model with d_model = 4096, each projection involves a [n_tokens, 4096] × [4096, d_head × n_heads] multiply. Fusing Q, K, V into one large multiply can improve GPU utilization by increasing the matrix size.

Check Yourself
shapeQ1

If hidden states have shape [n, 4096] and W_Q projects to d_head=128 with 32 heads, what is the shape of the full Q tensor before splitting into heads?

shapeQ2

If hidden states have shape [5, 512] and W_Q has shape [512, 64], what is the shape of Q?