Q, K, and V Are Just Learned Projections
Before attention can run, the model needs three versions of the hidden states: queries (Q), keys (K), and values (V). These are not three different inputs — they all come from the same hidden states, projected through three different learned weight matrices.
Think of it this way: Q asks "what am I looking for?", K says "what do I contain?", and V says "what information should I pass along if selected?" The Q-K interaction determines where to attend. V determines what gets passed along.
The projections are simple linear transforms — matrix multiplies with learned weights (most modern models omit the additive bias term to save parameters). The three weight matrices are the core learned parameters of attention.
You might wonder: why not skip the projections and compute attention directly on the hidden states? Two reasons:
- Separation of concerns. A token's hidden state encodes everything the model knows about that position. But for attention, you need different information for different roles: the "what am I looking for?" question (Q) is different from the "what do I contain?" answer (K), which is different from "what should I pass along?" (V). Three separate projections let the model learn to extract the right information for each role.
- Dimension control. The hidden state has dimension d_model (e.g., 4096), but each attention head operates in a smaller d_head space (e.g., 128). The projections reduce the dimension, which makes the per-head dot products cheaper and lets different heads learn different subspaces.
In short: the projections are where the model learns what attention should pay attention to. Without them, every head at every layer would ask the same question of the same data — there would be no learned specialization.
Hidden states for 3 tokens with d_model=4, projecting to d_head=3:
Same input, three different projections. Same output shape, but different learned transformations.
Each projection is a matrix multiply:
In llama.cpp, Q/K/V projections are matrix multiplies in the attention block. Some models fuse Q, K, and V into a single large weight matrix and split the result, while others keep them as separate weight tensors. You will see both patterns in model implementations.
In the wild: What this course calls WQ, WK, WV appears in model checkpoints as tensor names like model.layers.0.self_attn.q_proj.weight, ...k_proj.weight, ...v_proj.weight. Recognizing these names in weight files and config files is a practical skill you will use when inspecting models.
Q/K/V projections are three matrix multiplies per layer. For a model with d_model = 4096, each projection involves a [n_tokens, 4096] × [4096, d_head × n_heads] multiply. Fusing Q, K, V into one large multiply can improve GPU utilization by increasing the matrix size.
If hidden states have shape [n, 4096] and W_Q projects to d_head=128 with 32 heads, what is the shape of the full Q tensor before splitting into heads?
If hidden states have shape [5, 512] and W_Q has shape [512, 64], what is the shape of Q?