L08

Matrix Multiplication Without Fear

18 min

Question

How does matrix multiply work?

Intuition

Matrix multiplication is just many dot products organized into a grid.

If you have a matrix A with m rows and a matrix B with n columns, the result C has m rows and n columns. Each entry C[i][j] is the dot product of row i from A and column j from B.

The one rule: the inner dimensions must match. If A is [m, k] and B is [k, n], the result is [m, n]. If the inner k's don't match, the multiply is invalid.

A useful way to think about it: each row of A is one input item, and each column of B is one output feature. Matrix multiplication asks, for every input row and every output feature, "how strongly do these line up?"

That last sentence is the real transformer bridge. If the rows of A are token states and the columns of B are learned output features, then matrix multiply is the operation that applies one learned transformation to an entire token batch at once.

Toy Example

A [2, 3] × B [3, 2] = C [2, 2]:

A [2,3]

1 2 3 4 5 6

B [3,2]

7 8 9 10 11 12

C [2,2]

58 64 139 154

C[0,0] = 1×7 + 2×9 + 3×11 = 7 + 18 + 33 = 58

C[0,1] = 1×8 + 2×10 + 3×12 = 8 + 20 + 36 = 64

C[1,0] = 4×7 + 5×9 + 6×11 = 28 + 45 + 66 = 139

C[1,1] = 4×8 + 5×10 + 6×12 = 32 + 50 + 72 = 154

Try It

Matrix Multiply Builder

Edit the matrices. Pick one output cell and see exactly which row and column produced it.

A [2, 3]

B [3, 2]

C [2, 2]

Selected output: C[0,0]

1×7 + 2×9 + 3×11 = 58

Row 0 of A interacts with column 0 of B. Every output cell is one row-column dot product.

Shapes

[m, k] × [k, n] → [m, n]

Inner dimensions (k) must match. They "cancel out."

Output keeps the outer dimensions (m, n).

Transpose: Flipping Rows and Columns

The transpose of a matrix flips it along its diagonal: rows become columns and columns become rows. If A has shape [m, n], then A^T has shape [n, m].

A = [[1, 2, 3],

[4, 5, 6]] shape [2, 3]

A^T = [[1, 4],

[2, 5],

[3, 6]] shape [3, 2]

Row 0 of A became column 0 of A^T. Row 1 of A became column 1 of A^T. That is all a transpose does.

Why this matters: later you will see Q K^T in attention. If Q has shape [n, d] and K has shape [n, d], then K^T has shape [d, n]. Now the multiply [n, d] × [d, n] → [n, n] is valid, and the result is a token-by-token comparison matrix. Without the transpose, the inner dimensions would not match.

Matrix-Vector Multiply: The Single-Row Case

The full toy example above multiplied two multi-row matrices. But a common special case is multiplying a single vector (one row) by a weight matrix. The math is identical — just with m = 1.

One hidden-state vector times a weight matrix:

x = [2, -1, 3] shape [1, 3] (one row)

W = [[1, 0],

[0, 1],

[1, -1]] shape [3, 2]

x × W = [5, -4] shape [1, 2]

Each output value is one dot product between the input row and one column of W.

This matters because during autoregressive decode, the model processes one new token at a time. That means the "matrix" of hidden states is a single row. The math is the same as the full matrix case, but the runtime profile is very different: one row times a wide weight matrix is memory-bound rather than compute-bound, because you read many weight values but reuse each only once.

From One Dot Product to a Whole Token Batch

The previous lesson taught you how one vector compares to another. Matrix multiplication simply scales that up. Instead of asking for one score, you ask for every output feature for every input row.

In LLM terms, imagine A as a matrix of hidden states shaped [n_tokens, d_model]. Each row is one token vector. Now imagine B as a learned projection matrix shaped [d_model, d_out]. Each column is one output feature the model wants to compute.

Then A × B means: for every token row, compute all of the output features defined by the projection matrix. This is why matrix multiply shows up everywhere in transformer code. It is the natural way to transform a whole sequence of token vectors in one operation.

Running Example as a Matrix Multiply

Carry forward the tiny hidden-state matrix from the embeddings lesson:

X = [

[0.12, -0.34, 0.56, 0.78],

[0.91, 0.23, -0.67, 0.45],

[-0.11, 0.89, 0.33, -0.55]

]

This matrix has shape [3, 4]: three token rows, four features per row. If we multiply it by a projection matrix W of shape [4, 2], the result has shape [3, 2]. That means every token row gets remapped into a 2-dimensional output space.

The important thing is not just the formula. It is the batch interpretation: one learned matrix is applied to every token row in the sequence.

Math

C[i,j] = ∑ₖ A[i,k] × B[k,j]

Each output entry is one dot product.

So matrix multiplication is not mysterious new math. It is a scheduling pattern for many dot products. The outer grid tells you how many results you need. The inner dimension tells you how much work each result costs.

Common Misreadings

Beginners often confuse matrix multiplication with elementwise multiplication. They are completely different operations. Elementwise multiplication keeps the same shape and multiplies entries in the same positions. Matrix multiplication forms new outputs by summing across the shared inner dimension.

Another common mistake is to memorize the shape rule mechanically without understanding the semantics. Do not stop at [m, k] × [k, n] → [m, n]. Translate it. Say: "I have m input rows. Each row has k features. I want n output features for each row." That verbal narration is what makes the operation useful rather than ceremonial.

Do It Yourself

Practice the narration pattern:

[3, 4] × [4, 2] → [3, 2] means: 3 input rows, 4 features each, projected into 2 output features each.

[32, 2048] × [2048, 256] → [32, 256] means: 32 token states projected into a 256-dimensional space.

Implementation Hook

Matrix multiplication is the most common operation in LLM inference. In llama.cpp, the ggml_mul_mat function performs this operation. The model's core operations — projections and transformations you will learn about in later modules — are all matrix multiplies.

When you later read code like hidden states multiplied by W_q or W_up, mentally translate it into: "many row-column dot products happening in parallel."

This translation is worth practicing because it collapses a lot of apparent complexity. A huge amount of transformer code is ultimately "take this batch of token vectors and apply this learned linear map."

ggml/include/ggml.h — ggml_mul_mat() (L1407)

Performance Hook

Multiplying [m, k] by [k, n] requires m × k × n multiply-add operations. This is why matrix multiply dominates inference compute — the shapes are large and the operation runs for every layer.

It is also one of the operations hardware handles best, because the same rows and columns get reused many times. This is why GEMM kernels are so heavily optimized.

That reuse point matters. Matrix multiply is expensive, but it is structured work: rows and columns are reused enough that optimized kernels can amortize memory traffic and keep the arithmetic units busy. This is one reason matrix-heavy phases like prefill behave differently from more memory-sensitive decode paths.