L10A

Linear Algebra Synthesis: One Tiny Pipeline, End to End

18 min

Question

Can you follow the whole tiny pipeline end to end?

Why This Recap Exists

The previous lessons introduced the building blocks one at a time: vectors, embeddings, dot products, matrix multiplication, projections, and shape reasoning. That is the right way to learn the parts, but a self-learner also needs one place where the whole chain is put back together.

This page is that landing point. The goal is not to teach any new linear algebra. The goal is to make the existing pieces feel like one coherent language that you can actually use while reading model code.

The Tiny Running Example

We reuse the same toy prompt throughout:

token pieces = ["The", " cat", " sat"]

token IDs = [791, 2368, 3290]

toy model width = d_model = 4

toy query/key width = d_q = d_k = 2

Diagram: The Whole Flow

Before diving into the worked numbers, keep this one structural picture in mind. It is the entire linear algebra module compressed into one visual pipeline.

Discrete Input

[3]

Token IDs

three token positions

embedding lookup

Numeric State

[3, 4]

Hidden Matrix

three rows, four features each

projection

Role Space

Q [3, 2]

K [3, 2]

Q and K

two projected tensors from the same three rows

pairwise compare

Token Relations

[3, 3]

Scores

every token against every token

Step 1: IDs Become Vectors

At the tokenizer boundary, the model has only discrete IDs. The first numerical step is an embedding lookup:

791 → [0.12, -0.34, 0.56, 0.78]

2368 → [0.91, 0.23, -0.67, 0.45]

3290 → [-0.11, 0.89, 0.33, -0.55]

Stacked together, those rows form a hidden-state matrix:

X = [

[0.12, -0.34, 0.56, 0.78],

[0.91, 0.23, -0.67, 0.45],

[-0.11, 0.89, 0.33, -0.55]

]

shape(X) = [3, 4]

Plain-English narration: three token positions, each represented by a 4-dimensional vector.

Step 2: One Dot Product Compares Two Rows

Take the first two embedded token rows:

x₁ = [0.12, -0.34, 0.56, 0.78]

x₂ = [0.91, 0.23, -0.67, 0.45]

Their dot product is:

x₁ · x₂ = 0.12×0.91 + (-0.34)×0.23 + 0.56×(-0.67) + 0.78×0.45

= 0.1092 - 0.0782 - 0.3752 + 0.3510

≈ 0.0068

That number is close to zero because some coordinate contributions help while others hurt. This is the core interpretation of dot product: a compatibility score built from many local coordinate-level agreements and disagreements.

Step 3: A Projection Applies One Learned Map to Every Row

Now reuse the whole matrix X and apply a query projection W_q of shape [4, 2]:

W_q = [

[1, 0],

[0, 1],

[1, -1],

[0, 2]

]

The result is:

Q = XW_q = [

[0.68, 0.66],

[0.24, 1.80],

[0.22, -0.54]

]

shape(Q) = [3, 2]

Nothing magical happened. Three token rows came in; three token rows came out. The only thing that changed is the feature space. Each row is now a query vector rather than a generic hidden-state vector.

Step 4: A Second Projection Creates a Different Role

Apply a different learned matrix to the same input X:

W_k = [

[ 0, 1],

[ 1, 0],

[-1, 2],

[ 1, 0]

]

This produces:

K = XW_k = [

[-0.12, 1.24],

[ 1.35, -0.43],

[ 0.01, 0.55]

]

shape(K) = [3, 2]

The input rows are still the same rows from X. The role changed because the projection changed. This is the structural reason one hidden-state matrix can produce Q, K, and V tensors with different meanings.

Step 5: Shape Reasoning Predicts the Attention Score Matrix

Since Q and K both have shape [3, 2], the score matrix has shape [3, 3]:

S = QK^T

shape(S) = [3, 3]

Why [3, 3]? Because every token position is compared against every token position. If you want the actual toy numbers, they are:

S ≈ [

[ 0.7368, 0.6342, 0.3698],

[ 2.2032, -0.4500, 0.9924],

[-0.6960, 0.5292, -0.2948]

]

But the deeper lesson is that you did not need the exact values to predict the structure. The shape alone already told you what kind of object this had to be: token-to-token comparison scores.

Step 6: Value Projection and Weighted Combination

The score matrix tells the model how strongly each token position should attend to every other position. But the scores alone do not carry content — they are just weights. The actual content comes from a third projection: the value projection.

W_v = [

[ 1, -1],

[ 0, 1],

[ 1, 0],

[-1, 1]

]

V = XW_v = [

[-0.10, 0.32],

[-0.21, -0.23],

[ 0.77, 0.45]

]

shape(V) = [3, 2]

Now the attention mechanism normalizes the scores into weights (softmax, covered in the next module) and uses them to take a weighted combination of value rows:

weights = softmax(S) shape [3, 3]

output = weights × V shape [3, 3] × [3, 2] → [3, 2]

Read that shape trace carefully. Each output row is a weighted mix of all three value rows, where the weights come from that row's attention scores. The result has the same token count (3) and the same feature width as V (2).

This closes the attention loop with pure linear algebra: project to Q and K, compute pairwise scores, normalize, then use those weights to mix the V rows. Every step is a matrix multiply or an elementwise operation — nothing beyond what this module has taught.

What Stayed the Same, What Changed

Stage	What stayed the same	What changed
embedding	token count = 3	IDs become 4-dimensional vectors
projection	same 3 token rows	feature width changes from 4 to 2
score matrix	still about the same 3 token positions	representation becomes token-by-token comparisons [3,3]
weighted combination	still 3 token positions	each row is now a mix of value content, weighted by attention [3,2]

What to Memorize vs What to Narrate

Do not try to memorize every numeric toy value on this page. That is not the skill.

Memorize

vectors are ordered; embeddings are row lookups; matrix multiply applies one learned map to many rows; projections change feature space; attention scores compare token positions against token positions; attention weights mix value rows to produce each token's output.

Narrate

[3, 4] = three token rows, four features each. [3, 2] = the same rows in a new projected space. [3, 3] = every token compared against every token. [3, 2] after weighted mix = each token's attended summary in the value space.

Worked Practice with Solutions

Worked Solution Exercise 1

If X has shape [3, 4] and W has shape [4, 2], what is the output shape?

The answer is [3, 2].

Reason it through in words, not just symbols: there are still 3 token rows, and each row is now expressed in a 2-feature output space. Matrix multiply changed the feature width, not the number of token rows.

Worked Solution Exercise 2

If Q and K both have shape [3, 2], why is the score matrix [3, 3]?

The answer is that the score matrix compares every token position against every token position.

There are 3 query positions and 3 key positions, so the result must be a 3-by-3 grid of pairwise comparisons. The feature width 2 is the shared inner dimension used to compute each score, not the size of the output grid.

Worked Solution Exercise 3

If a score matrix has shape [32, 32], how many pairwise token scores does it contain?

The answer is 32 × 32 = 1024.

This is a good shape-to-performance habit. Before you even benchmark anything, the shape already tells you there are 1,024 pairwise comparisons in that one matrix.

Worked Solution Exercise 4

If the same hidden-state matrix X produces Q and K through different projections, what actually changed?

The input rows stayed the same, but the learned projection matrix changed.

That means the token positions are being re-expressed in a different learned feature space. The operational role changes because the projection changes, not because the original token rows suddenly became different objects.

Worked Solution Exercise 5

If weights (from softmax) have shape [3, 3] and V has shape [3, 2], what is the output shape and what does each row mean?

The output shape is [3, 2]: three token positions, each with 2 features.

Each output row is a weighted combination of all three value rows. The weights come from one row of the attention weight matrix — so output row 0 is the mix of V rows that token position 0 decided to attend to.

Bridge to the Next Module

Linear algebra got us to a crucial point: we can now build token representations, compare them, and transform them through learned maps. But none of that yet answers how the model chooses the next token.

The next module picks up from there. It explains logits, softmax, and sampling: how raw scores over the vocabulary become actual token choices. So the story continues naturally:

discrete IDs → embeddings → hidden states → projections / scores → logits → probabilities → next token

Code Preview: What This Looks Like in Practice

You now know enough to read the structure of a real model builder. Here is a simplified view of how a Gemma transformer layer is constructed in llama.cpp. You do not need to understand every argument — just match the function calls to the concepts you have learned:

// For each layer in the model:

cur = build_norm(inpL, ...);           // normalize before attention

Q = build_lora_mm(wq, cur);          // projection: hidden → query space
K = build_lora_mm(wk, cur);          // projection: hidden → key space
V = build_lora_mm(wv, cur);          // projection: hidden → value space

cur = build_attn(..., Q, K, V, ...);  // Q·Kᵀ → scores → softmax → weighted V

sa_out = ggml_add(cur, inpL);        // residual connection

cur = build_norm(sa_out, ...);         // normalize before FFN
cur = build_ffn(cur, ...);              // expand → activate → contract

output = ggml_add(cur, sa_out);     // residual connection

Based on ggml-org/llama.cpp @ 94ca829b — src/models/gemma.cpp. Simplified for this preview.

Every line maps to something you have already learned: build_lora_mm is a matrix multiply (a learned projection). ggml_add is vector addition (a residual connection). build_attn chains together the Q·K^T scores, softmax, and V-weighted sum you just traced by hand.

You will read the full, unabridged version of this code in the M8 case studies. For now, the point is: model code is not alien. It is the operations you already know, called in the order you already understand.

Micro Code-Reading Lab

Match each line from the code preview above to the concept it implements. Try before opening the answers.

1. build_norm(inpL, ...)

2. build_lora_mm(wq, cur)

3. build_attn(..., Q, K, V, ...)

4. ggml_add(cur, inpL)

5. build_ffn(cur, ...)

For each line, identify: RMSNorm, Q projection, attention (score + softmax + V-sum), residual connection, or FFN (expand + activate + contract).

Show answers

1. build_norm → RMSNorm — normalize before the attention sub-layer

2. build_lora_mm(wq, cur) → Q projection — multiply hidden state by W_Q

3. build_attn → Attention — Q·K^T scores, softmax, weighted V sum, output projection

4. ggml_add → Residual connection — add attention output back to input

5. build_ffn → FFN — expand to d_ff, apply activation, contract back to d_model