Embeddings Turn Token IDs into Vectors
The model cannot do math on bare integers. It needs each token to be a vector — a dense list of numbers it can transform.
The embedding table is a lookup table with one row per vocabulary entry. Given a token ID, the model looks up that row and gets a vector.
After embedding, a sequence of n tokens becomes a matrix: n rows, each with d_model columns. This is the model's actual input.
The token ID itself does not carry useful geometry. ID 791 is not "bigger" or "closer" to ID 792 in any meaningful way. The embedding table gives each token a learned position in a high-dimensional space the model can actually use.
That is the heart of the lesson: embeddings are the model's first meaningful numerical representation of discrete symbols. Before the lookup, the input is just a sequence of labels. After the lookup, the model has something it can compare, project, normalize, and transform with linear algebra.
Embedding lookup for 3 tokens with d_model = 4:
Three integers in, three vectors out. Stacked together, they form a [3, 4] matrix.
If the same token ID appears twice, the same row is looked up twice. The embedding table is shared across the whole sequence.
We can now instantiate the tiny prompt introduced on the previous page.
The three token IDs [791, 2368, 3290] become three rows:
Stacked together, these become a hidden-state matrix of shape [3, 4].
We will keep reusing this same tiny matrix when we talk about dot products, matrix multiplication, projections, and shape traces.
Imagine trying to do model computation directly on token IDs. The numbers themselves would be misleading. Token ID 100 is not
"twice as meaningful" as token ID 50, and ID 791 is not inherently similar to ID 792.
The integers are labels, not coordinates in a useful space.
The embedding table fixes that by assigning each vocabulary item a learned vector. Now the model can place frequently co-occurring or functionally similar pieces into regions of representation space that later layers can exploit. You do not need to imagine the full geometry precisely. The key point is simpler: the model replaces arbitrary IDs with a representation that supports meaningful algebra.
This is also why embeddings are learned rather than hand-written. Nobody manually specifies what each coordinate should mean. Training discovers an arrangement of vectors that makes later prediction easier.
You might think: why not just feed the integer ID into the model? The number 791 is already a number. But integer IDs are categorical labels, not measurements. The model needs a representation where arithmetic operations produce meaningful results. Consider three alternatives and why they fail:
- Raw integers: If "cat" is ID 2368 and "dog" is ID 5765, the model would infer that "dog" is 2.4× "cat" — meaningless. Arithmetic on IDs is nonsense.
- One-hot vectors: A vector of length |V| with a 1 at the token's position and 0 elsewhere. This avoids the ordering problem, but the vectors are all the same distance apart — and the dimension is |V| (e.g. 128,000), making every subsequent matrix multiply enormous. (Mathematically, multiplying a one-hot by a weight matrix is equivalent to looking up a row — which is exactly what an embedding table does. But one-hot makes the representation |V|-dimensional instead of d_model-dimensional.)
- Random vectors: Assign each token a random d_model-dimensional vector. This gives each token a unique representation, but the geometry is random — the model would have to learn to undo the randomness before doing anything useful.
Learned embeddings solve all three problems. Each token gets a d_model-dimensional vector that is adjusted during training so that the geometry is useful: tokens that play similar roles end up in similar regions of the space, and the model's later operations can exploit this structure directly.
The embedding table starts with random values (or some initialization scheme). During training, every time the model makes a wrong prediction, gradients flow back through the network — including back to the embedding table. The gradient for token 791 says "this embedding vector should have been slightly different to make the prediction better." Over billions of training examples, each row gets nudged into a position that helps the rest of the model.
The result: tokens that appear in similar contexts (like "cat" and "dog") develop similar embedding vectors, because similar vectors lead to similar model behavior, which is what the training objective rewards. This is not hand-designed — it emerges from the training signal.
At inference time, the embedding table is frozen. The values do not change. Embedding lookup is a pure table read.
When we say tokens are "nearby" in embedding space, we mean their vectors have similar coordinates — they point in roughly the same direction. (The next lesson will introduce tools to measure this precisely.) For example, if "cat" has embedding [0.12, -0.34, 0.56, 0.78] and "dog" has embedding [0.15, -0.31, 0.52, 0.81], these vectors are close: every coordinate is similar. But "cat" and ";" might have very different vectors — say [0.12, -0.34, 0.56, 0.78] versus [-0.88, 0.42, -0.15, 0.03] — they are far apart in the space.
This geometric structure is what makes the rest of the model work. When a later layer projects these vectors or computes dot products between them, the similarity structure propagates: similar embeddings produce similar projections and similar attention scores.
Embedding is a table lookup. Operationally:
Read that expression literally: "take the list of token IDs, and use each ID to index one row from the embedding table." If token_ids = [791, 2368, 3290], then X = [table[791], table[2368], table[3290]] — three rows stacked into a matrix.
Mathematically, this is equivalent to multiplying a one-hot matrix by the embedding table: X = one_hot(token_ids) × embedding_table. The one-hot matrix has shape [n_tokens, |V|], and the multiply selects the correct rows. But in practice, the lookup is implemented as array indexing — much faster than materializing a sparse [n_tokens, 128000] matrix.
The embedding dimension d_model is a design choice with direct consequences:
- Larger d_model (e.g., 4096 → 8192): More capacity to encode token distinctions. But the embedding table doubles in size (|V| × 8192 instead of |V| × 4096), and every subsequent operation in the model also doubles in width. The model becomes more capable but significantly more expensive.
- Smaller d_model (e.g., 4096 → 1024): The embedding table is 4× smaller. But each token has fewer dimensions to encode its identity, which means the model must compress more information into fewer numbers. Quality degrades.
For reference: GPT-2 Small uses d_model = 768. Llama 3 70B uses d_model = 8192. The choice cascades through every layer of the model — it is one of the most consequential architecture decisions.
It is important not to over-interpret embeddings. They are the initial representation of each token, not the final one. In most transformer architectures, positional information is added immediately after the token embedding — before the first attention block even runs. From there, layers compare tokens with one another and repeatedly transform these vectors. By the time the model is deep into the network, the hidden state for a token carries much more than the original embedding row.
That distinction matters when reading code. The embedding table is static model data — its values are fixed once training is complete. The hidden states are dynamic per-request tensors. Confusing those two leads to a lot of conceptual bugs when people first read transformer implementations.
A useful informal phrase is: "the embedding tells the model what token this is before context starts to matter; the hidden state tells the model what this token means here, in this sequence, after many layers of processing."
If the same token ID appears twice, the same embedding row is looked up twice. At the embedding step, those two positions start from the same vector. That part is simple.
But that does not mean those positions remain identical forever. Later, position handling and contextual processing will cause the two token positions to diverge. The same word in two different places can end up with very different hidden states.
So the embedding table gives you a shared starting point, not a permanent identity.
Try narrating the running example in plain English:
[791, 2368, 3290] means: 3 token IDs.[3, 4] after embedding means: 3 token positions, each with a 4-dimensional representation.[|V|, 4] means: one learned 4-dimensional row for each vocabulary entry.
The embedding table is a learned weight matrix stored in the model file. In llama.cpp, this is the first operation after tokenization. The table is loaded once and used for every token lookup.
This is the first place where symbolic input becomes numeric state. From here onward, the model only sees floating-point tensors, never strings.
That is why model-loading bugs around vocabulary order are so dangerous. If token IDs point at the wrong embedding rows, the rest of the model is operating on the wrong starting vectors even though all later tensor code may still look numerically valid.
The embedding table size is |V| × d_model parameters. For a 128K vocabulary with d_model=4096, that is 500M+ numbers. At FP16 (2 bytes each) that is roughly 1 GB; at FP32 it doubles. This table must fit in memory, and its footprint depends directly on the numerical precision the runtime uses. The lookup itself is fast — the cost is in storing it.
This is an important systems pattern: the embedding lookup is not compute-heavy like matrix multiply. It is more about memory footprint and memory access.
That makes embeddings a good example of why "hot" code is not always "math-heavy" code. Some inference costs are dominated by arithmetic, while others are dominated by storing and fetching large parameter tables efficiently.
If you embed 5 tokens with d_model = 768, what is the shape of the result?
If the same token ID appears twice in a prompt, what happens at the embedding step?
If 3 tokens are embedded with d_model = 4, how many scalar values are in the resulting matrix?