Matrix Multiplication Without Fear
Matrix multiplication is just many dot products organized into a grid.
If you have a matrix A with m rows and a matrix B with n columns, the result C has m rows and n columns. Each entry C[i][j] is the dot product of row i from A and column j from B.
The one rule: the inner dimensions must match. If A is [m, k] and B is [k, n], the result is [m, n]. If the inner k's don't match, the multiply is invalid.
A useful way to think about it: each row of A is one input item, and each column of B is one output feature. Matrix multiplication asks, for every input row and every output feature, "how strongly do these line up?"
That last sentence is the real transformer bridge. If the rows of A are token states and the columns of B are learned
output features, then matrix multiply is the operation that applies one learned transformation to an entire token batch at once.
A [2, 3] × B [3, 2] = C [2, 2]:
C[0,0] = 1×7 + 2×9 + 3×11 = 7 + 18 + 33 = 58
The transpose of a matrix flips it along its diagonal: rows become columns and columns become rows. If A has shape [m, n], then AT has shape [n, m].
Row 0 of A became column 0 of AT. Row 1 of A became column 1 of AT. That is all a transpose does.
Why this matters: later you will see Q KT in attention. If Q has shape [n, d] and K has shape [n, d],
then KT has shape [d, n]. Now the multiply [n, d] × [d, n] → [n, n] is valid, and the result
is a token-by-token comparison matrix. Without the transpose, the inner dimensions would not match.
The full toy example above multiplied two multi-row matrices. But a common special case is multiplying
a single vector (one row) by a weight matrix. The math is identical — just with m = 1.
One hidden-state vector times a weight matrix:
Each output value is one dot product between the input row and one column of W.
This matters because during autoregressive decode, the model processes one new token at a time. That means the "matrix" of hidden states is a single row. The math is the same as the full matrix case, but the runtime profile is very different: one row times a wide weight matrix is memory-bound rather than compute-bound, because you read many weight values but reuse each only once.
The previous lesson taught you how one vector compares to another. Matrix multiplication simply scales that up. Instead of asking for one score, you ask for every output feature for every input row.
In LLM terms, imagine A as a matrix of hidden states shaped [n_tokens, d_model].
Each row is one token vector. Now imagine B as a learned projection matrix shaped [d_model, d_out].
Each column is one output feature the model wants to compute.
Then A × B means: for every token row, compute all of the output features defined by the projection matrix.
This is why matrix multiply shows up everywhere in transformer code. It is the natural way to transform a whole sequence of token vectors in one operation.
Carry forward the tiny hidden-state matrix from the embeddings lesson:
This matrix has shape [3, 4]: three token rows, four features per row.
If we multiply it by a projection matrix W of shape [4, 2], the result has shape [3, 2].
That means every token row gets remapped into a 2-dimensional output space.
The important thing is not just the formula. It is the batch interpretation: one learned matrix is applied to every token row in the sequence.
So matrix multiplication is not mysterious new math. It is a scheduling pattern for many dot products. The outer grid tells you how many results you need. The inner dimension tells you how much work each result costs.
Beginners often confuse matrix multiplication with elementwise multiplication. They are completely different operations. Elementwise multiplication keeps the same shape and multiplies entries in the same positions. Matrix multiplication forms new outputs by summing across the shared inner dimension.
Another common mistake is to memorize the shape rule mechanically without understanding the semantics. Do not stop at
[m, k] × [k, n] → [m, n]. Translate it. Say:
"I have m input rows. Each row has k features. I want n output features for each row."
That verbal narration is what makes the operation useful rather than ceremonial.
Practice the narration pattern:
[3, 4] × [4, 2] → [3, 2] means: 3 input rows, 4 features each, projected into 2 output features each.[32, 2048] × [2048, 256] → [32, 256] means: 32 token states projected into a 256-dimensional space.
Matrix multiplication is the most common operation in LLM inference.
In llama.cpp, the ggml_mul_mat function performs this operation.
The model's core operations — projections and transformations you will learn about in later modules — are all matrix multiplies.
When you later read code like hidden states multiplied by W_q or W_up, mentally translate it into: "many row-column dot products happening in parallel."
This translation is worth practicing because it collapses a lot of apparent complexity. A huge amount of transformer code is ultimately "take this batch of token vectors and apply this learned linear map."
Multiplying [m, k] by [k, n] requires m × k × n multiply-add operations. This is why matrix multiply dominates inference compute — the shapes are large and the operation runs for every layer.
It is also one of the operations hardware handles best, because the same rows and columns get reused many times. This is why GEMM kernels are so heavily optimized.
That reuse point matters. Matrix multiply is expensive, but it is structured work: rows and columns are reused enough that optimized kernels can amortize memory traffic and keep the arithmetic units busy. This is one reason matrix-heavy phases like prefill behave differently from more memory-sensitive decode paths.
What is the output shape of [4, 3] × [3, 5]?
Can you multiply [2, 4] × [3, 5]?
What does one output cell C[i,j] represent?
How many output cells are in the result of [3, 4] × [4, 2]?
If K has shape [3, 2], what is the shape of K transpose (K^T)?
During autoregressive decode, the model processes one new token. Which best describes the projection step?