Dot Product as Weighted Comparison
The dot product takes two vectors of the same length and produces a single number. It multiplies corresponding elements and adds them up.
The result tells you something about how similar or aligned the two vectors are. If both vectors have large positive values in the same positions, the dot product is large and positive. If they point in opposite directions, it is negative.
One subtlety matters later: a raw dot product mixes alignment and magnitude. Two very large vectors can have a bigger dot product than two smaller but similarly oriented vectors. This is why later lessons will care about scaling and normalization.
In later modules, you will see that the dot product is central to how the model compares tokens and transforms representations. Understanding this operation is essential.
The most useful engineering phrase for a dot product is not just "similarity." It is a weighted compatibility score. Every coordinate contributes according to both vectors. Matching large coordinates push the score up. Opposing coordinates pull it down. Coordinates near zero contribute little.
Dot product of a = [2, 3, -1] and b = [4, -1, 2]:
Result: 3. A positive number — the vectors are somewhat aligned.
Magnitude matters too
[10, 10] · [1, 1] = 20
Both pairs point in roughly the same direction, but the larger vector produces a larger score. Keep that in mind when attention later scales scores by √d.
A dot product is best understood as a sum of local agreements and disagreements. One coordinate says, effectively, "I vote for these vectors going together." Another coordinate may say the opposite. The final scalar is the total result after all of those coordinate-level votes are aggregated.
This is why the operation fits attention so well. A query vector can be thought of as saying what kinds of features it is looking for. A key vector says what kinds of features are present. The dot product is then a score for how strongly those needs and offerings line up.
That is more specific and more useful than vague "similarity." Two vectors can be compatible for one task because the coordinates that matter to the current computation line up strongly, even if a human would not describe them as semantically similar in ordinary language.
Multiply each pair, sum the results. Both vectors must have the same dimension.
You can think of the final score as a ledger of contributions. Every coordinate votes for or against the final result. Coordinates where both vectors are large and same-signed dominate the sum.
| Pattern | Effect on score |
|---|---|
| large positive × large positive | Strong positive contribution |
| large positive × large negative | Strong negative contribution |
| small value × anything | Small effect, because that coordinate barely votes |
| many medium matches | Can outweigh one dramatic mismatch because the sum aggregates all coordinates |
The dot product has a geometric meaning that is worth understanding deeply, because it is the reason this operation is used in attention:
This formula separates the dot product into three factors:
- ||a|| — the magnitude of vector a. Larger vectors produce larger dot products.
- ||b|| — the magnitude of vector b. Same effect.
- cos(θ) — the cosine of the angle between them. This is the alignment factor. It ranges from -1 (pointing in opposite directions) through 0 (perpendicular / no relationship) to +1 (pointing in the same direction).
This explains everything:
- Positive dot product: vectors point roughly the same way (angle < 90°).
- Zero dot product: vectors are perpendicular / orthogonal (angle = 90°). No relationship.
- Negative dot product: vectors point roughly opposite (angle > 90°).
When two vectors are orthogonal (perpendicular), their dot product is exactly 0. This is a critical concept for understanding learned representations: orthogonal vectors contribute zero to each other's dot products.
In a d_model-dimensional space, you can have up to d_model mutually orthogonal directions. The embedding table and learned projections exploit this: different features (syntax, semantics, position, topic) can live in orthogonal subspaces, meaning they do not interact in dot-product computations. When a later dot product computes a score, features in orthogonal subspaces contribute zero — they are invisible to that particular comparison.
This is why high-dimensional representations are powerful: 4096 dimensions can encode many non-interfering features simultaneously. A 4-dimensional toy example cannot. Keep this in mind when the toy examples seem too simple — the real power comes from dimensionality.
The dot product of a vector with itself gives the squared length of that vector. The magnitude (or norm) is the square root of that value:
For example: ||[3, 4]|| = √(9 + 16) = √25 = 5.
Magnitude matters because a raw dot product conflates two things: how well-aligned the vectors are and how large they are. Two huge vectors pointing roughly the same way can outscore two small vectors pointing exactly the same way.
Cosine similarity strips out magnitude and measures only alignment:
Worked example: Take a = [3, 4] and b = [4, 3].
||a|| = √(9 + 16) = 5
||b|| = √(16 + 9) = 5
cosine(a, b) = 24 / (5 × 5) = 24/25 = 0.96
A cosine of 0.96 means the vectors are nearly aligned (perfect alignment = 1.0). Compare with a = [3, 4] and c = [-4, 3]: a · c = -12 + 12 = 0, so cosine = 0 — these vectors are orthogonal.
You will encounter cosine similarity in discussions about embedding comparisons. Inside the transformer itself,
the model uses raw dot products and then controls magnitude through explicit scaling (dividing by √d)
rather than normalizing to cosine. But the underlying concern is the same: magnitude can distort compatibility scores,
so something must be done about it.
Reuse the three 4-dimensional embeddings from the previous page. Take the first two rows:
Their dot product is:
0.12×0.91 + (-0.34)×0.23 + 0.56×(-0.67) + 0.78×0.45 ≈ 0.0068
That value is close to zero because the positive and negative coordinate contributions largely cancel out. This is a good reminder that the dot product is not looking for matching text labels. It is aggregating coordinate-level compatibility.
If you instead take x₁ · x₁, the score is much larger and positive because every coordinate agrees with itself.
We will use this exact idea again when one projected token representation compares to another inside attention.
Use the ledger idea when you compute by hand:
The dot product is the primitive behind matrix multiplication. In llama.cpp, many core model operations are built from dot products. The ggml_mul_mat operation performs many dot products at once.
In other words: once you understand one dot product, you already understand the inner loop of a huge amount of model code. The next page scales that idea up to full matrix multiplication.
This is a major code-reading milestone. Many low-level kernels look intimidating only because they optimize the mechanics of performing the same multiply-and-accumulate pattern at scale.
A dot product of dimension d requires d multiplications and d-1 additions. In attention, every token pair requires a dot product. This is why attention cost grows with both sequence length and model dimension.
For one token pair this is tiny. For 1,024 tokens, the model needs over a million pairwise scores in one attention head. That is why such a simple primitive becomes a real runtime cost.
This is one of the recurring lessons of systems work on LLMs: a tiny primitive repeated often enough becomes the product. The mathematical idea is simple; the engineering challenge is making that simple idea run efficiently for enormous shapes.
What is the dot product of [1, 0, 3] and [2, 5, -1]?
If you double the magnitude of every query vector (multiply by 2), what happens to the raw Q·K scores?
Which statement about a raw dot product is most accurate?
What is the dot product of [1, 2] and [3, 4]?
What is the magnitude (norm) of the vector [5, 12]?