M2/Linear Algebra
L05

Scalars and Vectors

12 min
What is a vector?

A scalar is a single number. A temperature reading, a count, a score — one number.

A vector is an ordered list of numbers. Not a set — the order matters. A vector of length 4 has exactly 4 numbers in a fixed sequence.

In LLMs, vectors are used to represent many things. You will see them used for token representations and model weights throughout this course. The "dimension" of a vector is how many numbers it contains. When someone says "the model has dimension 2048," they mean each hidden-state vector has 2,048 numbers.

The important jump is this: a token is no longer "the word cat." Inside the model, it becomes a vector with hundreds or thousands of coordinates. Each coordinate is just a number, but together they form the representation the rest of the network can transform.

That last sentence is easy to read too quickly, so pause on it. The model does not store one clean, human-readable field like "animalness = 0.82" or "past tense = 0.17" in a single coordinate. The representation is usually distributed: many coordinates jointly participate in whatever the model later needs to detect or compute.

This is the first big mental shift of the course. A vector is not just "a bunch of numbers." It is the basic container the model uses for everything it knows about a token at a given stage of computation.

If you are a software engineer, you already know what a vector is — it is a fixed-size array of floating-point numbers. A float[4] in C, a np.ndarray in NumPy, or a torch.Tensor in PyTorch. The linear algebra term "vector" adds one thing: the assumption that mathematical operations (addition, scaling, dot products) are meaningful on these arrays, not just indexing and iteration.

When we say "a vector of dimension 4096," we mean an array of 4096 floats where every operation in the model treats these as coordinates in a 4096-dimensional space. The operations are simple: add two vectors (element-by-element), multiply by a scalar (scale every element), or compute a dot product (multiply corresponding elements and sum). You already know how to do all of these on arrays. The linear algebra just gives them geometric names and guarantees.

When the model processes n tokens at once, it has n vectors — one per token. Stacking these vectors into rows gives a matrix: a 2D array with n rows and d_model columns. This is the fundamental data structure of inference.

In code terms: if each token's hidden state is float[d_model], then a sequence of n tokens is float[n][d_model] — or in tensor notation, shape [n, d_model]. Every operation in the model works on these matrices: projections multiply them by weight matrices, attention computes pairwise scores across rows, and the FFN transforms each row independently.

You do not need to think of matrices as abstract mathematical objects. Think of them as tables where each row is one token and each column is one feature. This table-reading skill is what tensor shape narration (L10) will formalize.

Scalar

3.7

One number. Dimension: none.

Vector (dim=4)

0.5
-1.2
3.0
0.8

Four numbers in order. Dimension: 4.

Elementwise operation: multiply each element by 2

[0.5, -1.2, 3.0, 0.8] × 2 = [1.0, -2.4, 6.0, 1.6]

Elementwise operation: add two vectors

[0.5, -1.2, 3.0, 0.8] + [1.0, 0.4, -1.0, 0.2] = [1.5, -0.8, 2.0, 1.0]

Add matching positions. Both vectors must have the same dimension. This operation is central to how transformers work: every attention and FFN sublayer adds its result back to its input (a residual connection), so vector addition is one of the most frequent operations in the whole model.

Why order matters

[1, 2, 3] ≠ [3, 2, 1]

Same numbers, different positions, different vector. Later, every projection and dot product depends on those positions lining up correctly.

Scalar: a single number (no dimensions)
Vector of dimension d: [d]   e.g. [4] means 4 numbers
Indexing: v[0], v[1], ..., v[d-1] (zero-indexed)
Sequence of token vectors: [n_tokens, d_model]   many vectors stacked into a matrix

In ordinary software, data structures usually have named fields. A request object might have user_id, timestamp, and payload. In neural networks, internal state is usually not stored that way. Instead, the model carries vectors and matrices whose coordinates are meaningful mainly because of how later computations use them.

That can feel alien at first. A senior engineer naturally wants to ask: what does coordinate 713 mean? Often there is no satisfying human-scale answer. The right question is usually not "what does this one coordinate mean?" but rather "what operations are applied to this whole vector next?" Once you adopt that lens, model code becomes much easier to follow.

This also explains why order inside a vector matters so much. Coordinate 0, coordinate 1, and coordinate 2 are not interchangeable. The learned weight matrices later in the model assume that each position in the vector means something different, even if that meaning is only implicit in the learned computation.

New notation:

v = [v₀, v₁, ..., vₙ₋₁]
dimension = n (the count of elements)

An elementwise operation applies the same function to every element independently: if v = [a, b, c], then 2v = [2a, 2b, 2c].

This is the last page where a vector lives by itself. In the next lessons, many vectors will be stacked into matrices so the model can process whole token sequences at once.

For the next few lessons, keep a tiny mental picture in mind: a prompt with three token IDs becomes three vectors, and those three vectors are stacked into a matrix. That small example is enough to understand most of the math that follows.

On this page, the only part you need is the atomic unit: one token representation is one vector of length d_model. The next page turns token IDs into those vectors. After that, we will compare vectors, batch them into matrix multiplies, and project them into new spaces.

To make the sequence concrete, we will keep reusing this miniature prompt: token IDs [791, 2368, 3290]. Right now they are just three discrete symbols. Soon they will become three vectors and then a small sequence matrix.

Practice narrating, not just counting:

[4] means: one vector with 4 coordinates.
[3, 4] means: 3 vectors, each with 4 coordinates.
[32, 2048] means: 32 token positions, each represented by a 2048-dimensional vector.

If that narration feels natural, you are already learning to read model code the right way.

In model code, vectors are stored as contiguous arrays of floating-point numbers. The dimension (e.g., d_model = 2048) is a model hyperparameter. Every token in the model is represented as a vector of this size.

When you later see a tensor shaped [n_tokens, d_model], read it as "one vector per token." That single reading habit will unlock a large fraction of model code.

This is also why shape narration matters so much. If you can say "this object is a batch of token vectors" instead of merely reciting the raw dimensions, you are already reading the code semantically rather than syntactically.

Vector dimension directly affects memory and compute. A model with d_model = 4096 stores twice as many numbers per token as one with d_model = 2048. This scales through every layer of the model.

Bigger vectors do not just mean "more capacity." They also mean wider projections, larger KV caches, and more bytes moved through every layer.

So even this humble lesson has a direct systems consequence: choosing a larger d_model changes almost every major runtime cost in the network. Wider vectors ripple into attention, FFN width, memory traffic, cache footprint, and total parameter count.

Check Yourself
shapeQ1

What is the dimension of the vector [3.1, -0.5, 2.0, 1.7, 0.0]?

conceptualQ2

Is a single number (like 42.0) a scalar or a vector?

conceptualQ3

Why are [1, 2, 3] and [3, 2, 1] different vectors?

mathQ4

What is [1, 2, 3] + [10, 20, 30]?

mathQ5

A tensor with shape [3, 4] contains how many scalar values in total?