Linear Algebra Synthesis: One Tiny Pipeline, End to End
The previous lessons introduced the building blocks one at a time: vectors, embeddings, dot products, matrix multiplication, projections, and shape reasoning. That is the right way to learn the parts, but a self-learner also needs one place where the whole chain is put back together.
This page is that landing point. The goal is not to teach any new linear algebra. The goal is to make the existing pieces feel like one coherent language that you can actually use while reading model code.
We reuse the same toy prompt throughout:
Before diving into the worked numbers, keep this one structural picture in mind. It is the entire linear algebra module compressed into one visual pipeline.
At the tokenizer boundary, the model has only discrete IDs. The first numerical step is an embedding lookup:
Stacked together, those rows form a hidden-state matrix:
Plain-English narration: three token positions, each represented by a 4-dimensional vector.
Take the first two embedded token rows:
Their dot product is:
That number is close to zero because some coordinate contributions help while others hurt. This is the core interpretation of dot product: a compatibility score built from many local coordinate-level agreements and disagreements.
Now reuse the whole matrix X and apply a query projection W_q of shape [4, 2]:
The result is:
Nothing magical happened. Three token rows came in; three token rows came out. The only thing that changed is the feature space. Each row is now a query vector rather than a generic hidden-state vector.
Apply a different learned matrix to the same input X:
This produces:
The input rows are still the same rows from X. The role changed because the projection changed.
This is the structural reason one hidden-state matrix can produce Q, K, and V tensors with different meanings.
Since Q and K both have shape [3, 2], the score matrix has shape [3, 3]:
Why [3, 3]? Because every token position is compared against every token position.
If you want the actual toy numbers, they are:
But the deeper lesson is that you did not need the exact values to predict the structure. The shape alone already told you what kind of object this had to be: token-to-token comparison scores.
The score matrix tells the model how strongly each token position should attend to every other position. But the scores alone do not carry content — they are just weights. The actual content comes from a third projection: the value projection.
Now the attention mechanism normalizes the scores into weights (softmax, covered in the next module) and uses them to take a weighted combination of value rows:
Read that shape trace carefully. Each output row is a weighted mix of all three value rows, where the weights come from that row's attention scores. The result has the same token count (3) and the same feature width as V (2).
This closes the attention loop with pure linear algebra: project to Q and K, compute pairwise scores, normalize, then use those weights to mix the V rows. Every step is a matrix multiply or an elementwise operation — nothing beyond what this module has taught.
| Stage | What stayed the same | What changed |
|---|---|---|
| embedding | token count = 3 | IDs become 4-dimensional vectors |
| projection | same 3 token rows | feature width changes from 4 to 2 |
| score matrix | still about the same 3 token positions | representation becomes token-by-token comparisons [3,3] |
| weighted combination | still 3 token positions | each row is now a mix of value content, weighted by attention [3,2] |
Do not try to memorize every numeric toy value on this page. That is not the skill.
[3, 4] = three token rows, four features each.
[3, 2] = the same rows in a new projected space.
[3, 3] = every token compared against every token.
[3, 2] after weighted mix = each token's attended summary in the value space.
Worked Solution Exercise 1
If X has shape [3, 4] and W has shape [4, 2], what is the output shape?
The answer is [3, 2].
Reason it through in words, not just symbols: there are still 3 token rows, and each row is now expressed in a 2-feature output space. Matrix multiply changed the feature width, not the number of token rows.
Worked Solution Exercise 2
If Q and K both have shape [3, 2], why is the score matrix [3, 3]?
The answer is that the score matrix compares every token position against every token position.
There are 3 query positions and 3 key positions, so the result must be a 3-by-3 grid of pairwise comparisons. The feature width 2 is the shared inner dimension used to compute each score, not the size of the output grid.
Worked Solution Exercise 3
If a score matrix has shape [32, 32], how many pairwise token scores does it contain?
The answer is 32 × 32 = 1024.
This is a good shape-to-performance habit. Before you even benchmark anything, the shape already tells you there are 1,024 pairwise comparisons in that one matrix.
Worked Solution Exercise 4
If the same hidden-state matrix X produces Q and K through different projections, what actually changed?
The input rows stayed the same, but the learned projection matrix changed.
That means the token positions are being re-expressed in a different learned feature space. The operational role changes because the projection changes, not because the original token rows suddenly became different objects.
Worked Solution Exercise 5
If weights (from softmax) have shape [3, 3] and V has shape [3, 2], what is the output shape and what does each row mean?
The output shape is [3, 2]: three token positions, each with 2 features.
Each output row is a weighted combination of all three value rows. The weights come from one row of the attention weight matrix — so output row 0 is the mix of V rows that token position 0 decided to attend to.
Linear algebra got us to a crucial point: we can now build token representations, compare them, and transform them through learned maps. But none of that yet answers how the model chooses the next token.
The next module picks up from there. It explains logits, softmax, and sampling: how raw scores over the vocabulary become actual token choices. So the story continues naturally:
You now know enough to read the structure of a real model builder. Here is a simplified view of how a Gemma transformer layer is constructed in llama.cpp. You do not need to understand every argument — just match the function calls to the concepts you have learned:
// For each layer in the model:
cur = build_norm(inpL, ...); // normalize before attention
Q = build_lora_mm(wq, cur); // projection: hidden → query space
K = build_lora_mm(wk, cur); // projection: hidden → key space
V = build_lora_mm(wv, cur); // projection: hidden → value space
cur = build_attn(..., Q, K, V, ...); // Q·Kᵀ → scores → softmax → weighted V
sa_out = ggml_add(cur, inpL); // residual connection
cur = build_norm(sa_out, ...); // normalize before FFN
cur = build_ffn(cur, ...); // expand → activate → contract
output = ggml_add(cur, sa_out); // residual connection
Based on ggml-org/llama.cpp @ 94ca829b — src/models/gemma.cpp. Simplified for this preview.
Every line maps to something you have already learned: build_lora_mm is a matrix multiply (a learned projection). ggml_add is vector addition (a residual connection). build_attn chains together the Q·KT scores, softmax, and V-weighted sum you just traced by hand.
You will read the full, unabridged version of this code in the M8 case studies. For now, the point is: model code is not alien. It is the operations you already know, called in the order you already understand.
Match each line from the code preview above to the concept it implements. Try before opening the answers.
build_norm(inpL, ...)build_lora_mm(wq, cur)build_attn(..., Q, K, V, ...)ggml_add(cur, inpL)build_ffn(cur, ...)For each line, identify: RMSNorm, Q projection, attention (score + softmax + V-sum), residual connection, or FFN (expand + activate + contract).
Show answers
build_norm → RMSNorm — normalize before the attention sub-layerbuild_lora_mm(wq, cur) → Q projection — multiply hidden state by WQbuild_attn → Attention — Q·KT scores, softmax, weighted V sum, output projectionggml_add → Residual connection — add attention output back to inputbuild_ffn → FFN — expand to d_ff, apply activation, contract back to d_modelIf X has shape [3, 4] and W_q has shape [4, 2], what is shape(Q)?
What is the best plain-English reading of a [3, 3] attention score matrix?
How many scalar values are in a [3, 2] matrix?
Why can the same hidden-state matrix produce both Q and K?
If attention weights have shape [3, 3] and V has shape [3, 2], what is the output shape?