L10

Tensor Shapes Are the Language of Model Code

12 min

Question

Why are shapes the language of model code?

Intuition

When you read model code, you will not see the actual numbers flowing through the model. You will see shapes: how many dimensions a tensor has, how large each dimension is, and which axis means what.

If you can track shapes, you can follow the code. If you cannot, the code is opaque. Shape reasoning is the single most useful skill for reading model implementations.

The practical trick is to narrate every axis in plain language. Do not just memorize [32, 2048]. Say: "32 tokens, each with 2048 features." That verbal translation is what makes code readable.

This is the lesson where the earlier pages are supposed to lock together. Scalars, vectors, dot products, matrices, and projections all become manageable once you can look at a tensor shape and say what each axis means. Without that skill, model code feels like a blur of operators. With it, the code starts to read like a dataflow graph.

Toy Example

Tensor shapes introduced so far:

[n_tokens, d_model] Hidden states — one vector per token

[|V|, d_model] Embedding table — one vector per vocabulary entry

[d_model, d_out] A weight matrix — used in projections

A tiny shape trace

3 token IDs → embedding lookup → hidden states [3, 4]

hidden [3, 4] × W_q [4, 2] → Q [3, 2]

Q [3, 2] and K [3, 2] → scores [3, 3]

That is the real payoff of shape reasoning: you can follow a pipeline without knowing the actual numeric values.

Shapes

Axis 0 (first dimension): usually n_tokens — the sequence axis

Axis 1 (second dimension): usually a feature dimension — d_model, d_head, d_ff, |V|

Convention: this course always uses row-major [n_tokens, features] order.

A Shape-Reading Workflow

When you hit a tensor in model code, use the same four-step workflow every time:

State the raw shape.
Name each axis in plain English.
Ask what operation produced it.
Ask what operation will consume it next.

That simple routine prevents most early confusion. You stop seeing tensors as anonymous arrays and start seeing them as typed pieces of the computation: hidden states, projections, scores, probabilities, caches.

Math

No new formulas. The skill here is reading shapes:

2D tensor [a, b]: a rows, b columns
3D tensor [a, b, c]: a groups of [b, c] matrices
The token axis tells you "how many tokens"
The feature axis tells you "how rich is each token's representation"

When in doubt, ask two questions: 1. which axis counts tokens? 2. which axis counts features or channels? If you can answer both, you can usually keep reading.

3D Tensors and Multi-Head Shapes

So far every tensor has been 2D: rows and columns. Real model code adds a third dimension almost immediately, because attention uses multiple heads.

A 3D tensor [a, b, c] is a groups of [b, c] matrices. In multi-head attention, the common shape is:

[n_heads, n_tokens, d_head]

Each head has its own [n_tokens, d_head] matrix of projected token vectors.

Worked example: 2 heads, 3 tokens, head dimension 2

Q has shape [2, 3, 2]

Head 0: [[0.68, 0.66], [0.24, 1.80], [0.22, -0.54]]

Head 1: [[0.41, -0.12], [0.55, 0.90], [-0.33, 0.71]]

Each head sees the same token positions but in a different learned subspace. Head 0 and Head 1 run the same dot-product attention logic independently, then their results are concatenated and projected back.

The reshape from 2D to 3D is where many learners first get confused. Concretely, the model computes Q with shape [n_tokens, n_heads × d_head] and then reshapes it to [n_heads, n_tokens, d_head]. No numbers change — only the layout's interpretation changes. When you see a reshape like this in code, stop and restate each axis before continuing.

Broadcasting: A Brief Heads-Up

In real model code you will sometimes see operations between tensors that do not have identical shapes — for example, adding a bias vector of shape [d_out] to a matrix of shape [n_tokens, d_out]. The framework automatically "broadcasts" the smaller tensor, repeating it along the missing axis so the shapes align.

This course does not rely on broadcasting for any core explanation, but recognizing it in code prevents a common source of shape confusion. The rule of thumb: when an axis is missing from one operand, the framework repeats that operand along that axis. If shapes still do not align after that, the operation is invalid.

Three Progressive Shape Traces

Trace 1: embedding

token_ids: [3]

embedding_table: [|V|, 4]

hidden: [3, 4]

Read it as: three token positions, each now represented by a 4-dimensional vector.

Trace 2: projection

hidden: [3, 4]

W_q: [4, 2]

Q: [3, 2]

Read it as: the same three token positions, now each expressed in a 2-dimensional query space.

Trace 3: attention scores

Q: [3, 2]

K: [3, 2]

scores: [3, 3]

Read it as: every token position compared against every token position.

Running Example: One Full Mini Trace

Put the whole mini sequence together one last time:

token_ids: [3]

embedding lookup → hidden: [3, 4]

hidden × W_q [4, 2] → Q: [3, 2]

hidden × W_k [4, 2] → K: [3, 2]

Q K^T → scores: [3, 3]

If you can read those five lines comfortably, you have crossed the main threshold this module is trying to teach. You no longer need to see the actual floating-point values to understand what the code is doing structurally.

Common Shape Mistakes

The most common beginner error is to track numbers but not meanings. A learner memorizes that [32, 64] × [64, 32] → [32, 32] and still does not understand what happened. The cure is always the same: narrate the axes.

Another mistake is to let the same symbol drift semantically. For example, 64 might mean a feature width on one line and a token count on the next. Shapes only help if you keep the meaning of each axis attached to the number.

The final common mistake is to ignore reshapes and transposes. In real implementations, a tensor may carry the same numeric contents while changing layout or axis meaning. When a reshape happens, stop and restate the axes before continuing.

Do It Yourself

Practice translating these into English:

[3, 4]: three token positions, each with four features.

[3, 2]: the same three token positions, now in a two-feature projected space.

[3, 3]: every token position compared against every token position.

Implementation Hook

In llama.cpp code, tensors carry explicit shape metadata. When you see a tensor being created or reshaped, reading the dimensions tells you exactly what data it holds. Real code may use transposed or permuted layouts — but the pedagogical shapes in this course always use [n_tokens, features].

The useful habit is: when a tensor is created, renamed, or reshaped, stop and restate what each axis now means. That is the fastest way to stop code from becoming a blur of operators.

This is especially valuable in systems work. Performance discussions often sound abstract until you connect them back to shapes: how many token rows, how many feature columns, how much reuse, how many pairwise comparisons, what changes between prefill and decode.

Performance Hook

Shapes determine compute cost. A [1024, 4096] × [4096, 4096] multiply is 4× more work than [256, 4096] × [4096, 4096]. When reasoning about performance, start with shapes.

Shapes also explain why attention is special: projections scale linearly with token count, but score matrices scale with token-count squared because they compare token positions against one another.

In other words, shape reasoning is already performance reasoning in disguise. Before you open a profiler, the shapes often tell you which operations are likely to dominate.

Shape Trace Exercise

Trace the shapes through a small pipeline. Fill in every dimension. Think before you type — narrate each step like you practiced above.

Embedding → Projection → Attention Scores

Token IDs(5 tokens)

[]

→After embedding lookup(d_model=256)

[,]

→Q projection(d_head=64)

[,]

→K projection(d_head=64)

[,]

→Attention scores Q·Kᵀ

[,]

Check Yourself

shapeQ1

For a sequence of hidden states shaped [32, 2048], which axis represents the token positions?

Axis 0 (32) = tokens, Axis 1 (2048) = featuresAxis 0 (32) = features, Axis 1 (2048) = tokensCannot tell without more context

shapeQ2

If Q has shape [32, 64] and K has shape [32, 64], what is the attention score shape?

[32, 64][64, 64][32, 32]

mathQ3

If the score matrix shape is [32, 32], how many pairwise token scores does it contain?

conceptualQ4

A tensor with shape [8, 32, 64] represents multi-head Q vectors. What does each axis mean?

8 heads, 32 token positions, 64 features per head8 tokens, 32 heads, 64 features per head8 batches, 32 features, 64 tokens