Tensor Shapes Are the Language of Model Code
When you read model code, you will not see the actual numbers flowing through the model. You will see shapes: how many dimensions a tensor has, how large each dimension is, and which axis means what.
If you can track shapes, you can follow the code. If you cannot, the code is opaque. Shape reasoning is the single most useful skill for reading model implementations.
The practical trick is to narrate every axis in plain language. Do not just memorize [32, 2048]. Say: "32 tokens, each with 2048 features." That verbal translation is what makes code readable.
This is the lesson where the earlier pages are supposed to lock together. Scalars, vectors, dot products, matrices, and projections all become manageable once you can look at a tensor shape and say what each axis means. Without that skill, model code feels like a blur of operators. With it, the code starts to read like a dataflow graph.
Tensor shapes introduced so far:
A tiny shape trace
That is the real payoff of shape reasoning: you can follow a pipeline without knowing the actual numeric values.
When you hit a tensor in model code, use the same four-step workflow every time:
- State the raw shape.
- Name each axis in plain English.
- Ask what operation produced it.
- Ask what operation will consume it next.
That simple routine prevents most early confusion. You stop seeing tensors as anonymous arrays and start seeing them as typed pieces of the computation: hidden states, projections, scores, probabilities, caches.
No new formulas. The skill here is reading shapes:
- 2D tensor [a, b]: a rows, b columns
- 3D tensor [a, b, c]: a groups of [b, c] matrices
- The token axis tells you "how many tokens"
- The feature axis tells you "how rich is each token's representation"
When in doubt, ask two questions: 1. which axis counts tokens? 2. which axis counts features or channels? If you can answer both, you can usually keep reading.
So far every tensor has been 2D: rows and columns. Real model code adds a third dimension almost immediately, because attention uses multiple heads.
A 3D tensor [a, b, c] is a groups of [b, c] matrices. In multi-head attention,
the common shape is:
Worked example: 2 heads, 3 tokens, head dimension 2
Each head sees the same token positions but in a different learned subspace. Head 0 and Head 1 run the same dot-product attention logic independently, then their results are concatenated and projected back.
The reshape from 2D to 3D is where many learners first get confused. Concretely, the model computes Q with shape
[n_tokens, n_heads × d_head] and then reshapes it to [n_heads, n_tokens, d_head].
No numbers change — only the layout's interpretation changes. When you see a reshape like this in code, stop and
restate each axis before continuing.
In real model code you will sometimes see operations between tensors that do not have identical shapes — for example,
adding a bias vector of shape [d_out] to a matrix of shape [n_tokens, d_out]. The framework
automatically "broadcasts" the smaller tensor, repeating it along the missing axis so the shapes align.
This course does not rely on broadcasting for any core explanation, but recognizing it in code prevents a common source of shape confusion. The rule of thumb: when an axis is missing from one operand, the framework repeats that operand along that axis. If shapes still do not align after that, the operation is invalid.
Trace 1: embedding
Read it as: three token positions, each now represented by a 4-dimensional vector.
Trace 2: projection
Read it as: the same three token positions, now each expressed in a 2-dimensional query space.
Trace 3: attention scores
Read it as: every token position compared against every token position.
Put the whole mini sequence together one last time:
If you can read those five lines comfortably, you have crossed the main threshold this module is trying to teach. You no longer need to see the actual floating-point values to understand what the code is doing structurally.
The most common beginner error is to track numbers but not meanings. A learner memorizes that
[32, 64] × [64, 32] → [32, 32] and still does not understand what happened.
The cure is always the same: narrate the axes.
Another mistake is to let the same symbol drift semantically. For example, 64 might mean a feature width on one line and a
token count on the next. Shapes only help if you keep the meaning of each axis attached to the number.
The final common mistake is to ignore reshapes and transposes. In real implementations, a tensor may carry the same numeric contents while changing layout or axis meaning. When a reshape happens, stop and restate the axes before continuing.
Practice translating these into English:
[3, 4]: three token positions, each with four features.[3, 2]: the same three token positions, now in a two-feature projected space.[3, 3]: every token position compared against every token position.
In llama.cpp code, tensors carry explicit shape metadata.
When you see a tensor being created or reshaped, reading the dimensions tells you exactly what data it holds.
Real code may use transposed or permuted layouts — but the pedagogical shapes in this course always use [n_tokens, features].
The useful habit is: when a tensor is created, renamed, or reshaped, stop and restate what each axis now means. That is the fastest way to stop code from becoming a blur of operators.
This is especially valuable in systems work. Performance discussions often sound abstract until you connect them back to shapes: how many token rows, how many feature columns, how much reuse, how many pairwise comparisons, what changes between prefill and decode.
Shapes determine compute cost. A [1024, 4096] × [4096, 4096] multiply is 4× more work than [256, 4096] × [4096, 4096]. When reasoning about performance, start with shapes.
Shapes also explain why attention is special: projections scale linearly with token count, but score matrices scale with token-count squared because they compare token positions against one another.
In other words, shape reasoning is already performance reasoning in disguise. Before you open a profiler, the shapes often tell you which operations are likely to dominate.
Trace the shapes through a small pipeline. Fill in every dimension. Think before you type — narrate each step like you practiced above.
For a sequence of hidden states shaped [32, 2048], which axis represents the token positions?
If Q has shape [32, 64] and K has shape [32, 64], what is the attention score shape?
If the score matrix shape is [32, 32], how many pairwise token scores does it contain?
A tensor with shape [8, 32, 64] represents multi-head Q vectors. What does each axis mean?