M2/Linear Algebra
L09

Linear Projections Change Representation Size

12 min
What is a linear projection?

A linear projection is a matrix multiply with a learned weight matrix. It can change the width of your vectors, or keep it the same. If your hidden state has dimension 2048 and you need a query vector of dimension 256, you multiply by a learned weight matrix that maps [2048] to [256].

The weight matrix is learned during training. It decides which combinations of input features matter for the output. You do not need to understand the learning — just that the projection is a matrix multiply with a specific shape contract.

This is the bridge from linear algebra to transformer semantics. The same hidden state can be projected into different spaces with different learned matrices. That is why one input vector can become a query, a key, a value, or an FFN-expanded representation.

Another way to say this: a projection is not merely resizing a vector. It is re-expressing that vector in a new learned coordinate system. The output coordinates ask different questions of the input than the original coordinates did.

Each column of the weight matrix defines one output feature. When you compute output[j] = hidden · W[:, j], you are asking: "how much does this hidden state align with column j?"

If column 0 of WQ has large values in the positions corresponding to "syntax-related" features and small values elsewhere, then output[0] measures "how syntactic is this token?" Column 1 might measure something entirely different. The model learns these columns during training — each one becomes a learned "question" that extracts a specific kind of information from the input.

This is why the same hidden state can be projected into Q, K, and V using different weight matrices: each matrix asks different questions of the same input. WQ asks "what are you looking for?", WK asks "what do you contain?", and WV asks "what should you pass along?" — but they all start from the same hidden vector.

Projections can change the dimension of a vector. This is a core mechanism used throughout the model:

  • Contract (d_model → d_head): Q, K, V projections reduce dimension from 4096 to 128 per head. This is a lossy compression — the model learns which 128 features matter for this head's role.
  • Expand (d_model → d_ff): The FFN up-projection widens from 4096 to 16384. This gives the activation function a wider space to work in — more features to selectively activate or suppress.
  • Same dimension (d_model → d_model): The output projection WO maps attention output back to model width. Same size in and out, but the content is transformed.

The shape contract is always: [n, d_in] × [d_in, d_out] = [n, d_out]. The inner dimension d_in must match. The output width d_out is determined by the weight matrix shape — it is a design choice.

Projecting from d_model=4 to d_q=2:

hidden = [2, -1, 3, 0]
W_q =
[[1, 0],
 [0, 1],
 [1, -1],
 [0, 2]]

Step by step:
output[0] = 2×1 + (-1)×0 + 3×1 + 0×0 = 2 + 0 + 3 + 0 = 5
output[1] = 2×0 + (-1)×1 + 3×(-1) + 0×2 = 0 - 1 - 3 + 0 = -4

hidden × W_q = [5, -4]

Each output value is a dot product of the hidden vector with one column of W_q. Output[0] uses column 0 = [1, 0, 1, 0] — it picks out hidden[0] + hidden[2]. Output[1] uses column 1 = [0, 1, -1, 2] — a different combination.

Each output coordinate is a weighted sum of the input coordinates. The projection does not just shrink width; it mixes features into a new basis.

W_k =
[[0, 1],
 [1, 0],
 [-1, 2],
 [1, 0]]
hidden × W_k = [-4, 8]

Same input, different learned matrix, different output space. That is the key idea behind Q/K/V: the role changes because the projection changes.

Focus on a single column of the weight matrix. That column defines one output coordinate. In the toy example, the first output coordinate of hidden × W_q is the dot product between hidden and the first column of W_q.

So each output coordinate is a learned weighted sum of the input coordinates. If the input vector has features the model cares about for that output role, the corresponding weights amplify or suppress those features. The full projection simply computes many such learned coordinates at once.

This is why columns of the projection matrix are such a useful mental object. Each column is one learned feature extractor for the new representation space.

Input Row
h₀ h₁ h₂ h₃

one token state in the old space

Weight Columns
w₀₀ w₁₀ w₂₀ w₃₀
w₀₁ w₁₁ w₂₁ w₃₁

each column defines one learned output feature

Output Row
q₀ q₁

same token, new role-specific coordinates

Projection: X [n_tokens, d_in] × W [d_in, d_out] = Y [n_tokens, d_out]
Common uses in LLMs:
Q projection: [n, d_model] × [d_model, d_q] → [n, d_q]
Output head: [n, d_model] × [d_model, |V|] → [n, |V|]
FFN up-projection: [n, d_model] × [d_model, d_ff] → [n, d_ff]

Suppose the same hidden state goes through W_q, W_k, and W_v. The input row is identical in all three cases. What changes is the learned matrix.

That means each projection asks a different set of questions of the same input. One learned space may be good for "what information am I looking for?" Another may be good for "what information do I contain that others might want?" Another may be good for the actual content to be transferred forward.

The exact semantics are learned, not hand-programmed. But the structural idea is simple: change the projection matrix, and you change the operational role of the resulting vector.

Y = X W
where W is a learned weight matrix of shape [d_in, d_out]

Read the columns of W as output features. Each column tells you how to combine the input coordinates to produce one output coordinate.

For a whole token sequence, the same projection matrix is applied to every token row. That shared learned map is exactly what gives transformer layers their strong regular structure.

Some models also add a bias vector after the multiply: Y = XW + b, where b has shape [d_out] and is added to every row. Other models (including LLaMA) skip the bias entirely. When reading real code, check whether a bias is present; the core shape logic stays the same either way.

The toy example used one hidden vector because that is the smallest readable case. But real inference nearly always applies projections to a matrix of hidden states, not a single row.

If X has shape [n_tokens, d_model], then the projection produces Y = XW with shape [n_tokens, d_out]. Each token row is remapped into the new space, and all rows share the same learned matrix. This "one matrix, many token rows" pattern is one of the most important repeated structures in the entire model.

Now combine the running example from the embeddings page with the projection idea from this page. Start from the hidden matrix:

X = [
  [0.12, -0.34, 0.56, 0.78],
  [0.91, 0.23, -0.67, 0.45],
  [-0.11, 0.89, 0.33, -0.55]
]

Using the toy matrix W_q from above, the whole sequence projects to:

Q = XW_q = [
  [0.68, 0.66],
  [0.24, 1.80],
  [0.22, -0.54]
]

The exact numbers are not the main point. The shape and interpretation are. Three token rows came in, and three token rows came out. What changed is the feature space: each row is now a query vector instead of a generic hidden-state vector.

When you see a projection in code, ask these questions in order:

1. What is the input tensor shape?
2. What is the weight matrix shape?
3. What output shape follows from the multiply?
4. What semantic role does the new space have: Q, K, V, FFN-up, output logits?

Every Q, K, V, output, and FFN layer in a transformer uses projections. In llama.cpp, each projection is a ggml_mul_mat call with a stored weight tensor.

This is why transformer code can look repetitive: same operation family, different tensors, different semantic role.

Once you internalize that pattern, code navigation gets much easier. A projection site is no longer mysterious. You can ask: what is the input shape, what is the weight shape, what output shape comes out, and what semantic role does this output play next?

Projection cost is n_tokens × d_in × d_out multiply-adds. For large models, d_ff can be 4× d_model or more, making FFN projections the most expensive operations per layer.

This also explains why prefill and decode feel different: in prefill, many token rows are multiplied at once; in decode, one row is multiplied by many large matrices repeatedly.

That distinction becomes central later when you reason about GEMM versus GEMV behavior. The math operation family is the same, but the runtime regime changes a lot depending on whether you are projecting many rows at once or one row at a time.

Check Yourself
shapeQ1

If hidden states have shape [10, 512] and W_q has shape [512, 64], what is the Q shape?

conceptualQ2

Why can the same hidden state produce both Q and K vectors?

mathQ3

If X has shape [3, 4] and W has shape [4, 2], how many output features does each token row have after projection?