Vocabulary and Token IDs
The model's vocabulary is a finite table. Every possible token piece has a unique integer ID.
When text is tokenized, each piece is looked up in this table to find its ID. The model's actual input is a sequence of these integers — not strings.
The vocabulary size (often written |V|) is fixed when the model is trained. A typical vocabulary has 32,000 to 256,000 entries.
That fixed vocabulary is the model's discrete interface to language. It does not invent new IDs at runtime. If text is unusual, the tokenizer must express it using smaller pieces that already exist in the table.
Look at the Token IDs section below. Each token piece maps to a specific integer, and those IDs are what the model actually consumes. The widget is a teaching approximation, so the lesson is about the mapping pattern, not about exact production token counts.
Vocabulary lookup for "The cat sat":
| Token piece | Token ID |
|---|---|
| "The" | 791 |
| " cat" | 2368 |
| " sat" | 3290 |
The model receives [791, 2368, 3290] — three integers, nothing else.
The number 2368 does not inherently mean " cat". It only means that because the tokenizer vocabulary, embedding table, and output head all agree on that mapping.
If we permuted every vocabulary ID and permuted the corresponding rows and columns of the model consistently, the model could behave the same.
So the ID values are arbitrary in one sense, but the mapping is critically important in another. They are arbitrary labels that become meaningful only because the entire inference stack is built around them. This is why a tokenizer mismatch is catastrophic even though the integers themselves look innocent.
The vocabulary is fixed at training time because the model's embedding table and output projection are weight matrices with a fixed number of rows — one per vocabulary entry. Adding a new token would require adding a new row to both matrices, which means changing the model's parameters. At inference time, the model's parameters are frozen, so the vocabulary is frozen too.
- There is no runtime step that creates a brand-new token ID for a novel string.
- Rare or awkward text must be decomposed into smaller pieces that already exist in the vocabulary.
- Special tokens like end-of-sequence or control markers also occupy vocabulary IDs.
- The same ID namespace appears at both ends of inference: as input IDs and as output logits over the vocabulary.
A useful first-principles view is that the vocabulary appears twice in inference. At the beginning, token IDs index into the embedding table to fetch learned vectors. At the end, the model produces one logit per vocabulary entry, and decoding chooses one of those IDs as the next token.
That symmetry explains a lot of architecture decisions. The model never predicts raw strings directly. It predicts a distribution over vocabulary IDs. Detokenization is a separate post-processing step that turns the chosen IDs back into visible text pieces.
This also explains why vocabulary size affects both memory and compute. Bigger vocabularies mean a larger embedding table on the input side and a larger logits vector on the output side.
The new concept is indexing into a finite table:
The vocabulary is just a lookup table. Given an ID, you get the corresponding text piece. Given a text piece, you get the ID.
The token ID is not the meaning. It is only the row index used to fetch a learned vector. Later, the model produces |V| output scores so one of those IDs can be chosen again.
Not every vocabulary entry corresponds to visible user text. Many tokenizers include special markers for things like beginning-of-sequence, end-of-sequence, turn boundaries, fill-in-the-middle markers, or tool-call structure.
These tokens matter because the model was trained to react to them as part of its protocol. To a systems engineer, this is a reminder that the vocabulary is not just a dictionary of word pieces. It is part of the model's control surface.
This is also the first place hidden prompt overhead appears. A chat runtime may wrap a user-visible message in extra structure such as BOS tokens, role markers, separators, or assistant-prefix tokens before the model sees anything. The user may think they sent one short sentence, while the actual tokenized request is larger because protocol tokens were added around it.
Focus on stages 2, 3, and 5 below: token IDs first select embedding rows, then the final hidden state is scored against the same vocabulary space.
Raw Text
A string of characters. This is what the user types.
"The cat sat"The model cannot process raw text directly. It must first be converted into numbers.
A larger vocabulary can help because common patterns may collapse into fewer tokens. That can shorten prompts and reduce the amount of sequence work the model performs. But the larger vocabulary also increases the size of the embedding table and the number of logits produced at each decoding step.
A smaller vocabulary does the opposite. It reduces table sizes, but it forces more text to be expressed as multiple pieces. That can increase prompt length and sometimes make generation less efficient because the model has to reason over more tokens.
This is why vocabulary design is not a cosmetic tokenizer choice. It is part of the end-to-end serving tradeoff between model size, prompt compression, and output-head cost.
Open src/llama-vocab.cpp and look for llama_tokenize() and llama_detokenize().
What to notice: the vocabulary object owns both directions of the discrete interface. The same ID space is used to go from text pieces to integers and back again.
Then open src/llama-batch.cpp and read the validation loop in llama_batch_allocr::init().
What to notice: the runtime explicitly rejects any token ID outside [0, vocab.n_tokens()).
That is the discrete contract made visible in code: IDs are not arbitrary runtime data, they must belong to the model's fixed vocabulary.
Finally, notice how this connects to the model boundary you already learned: an input token ID acts like a row selector into the embedding matrix, and the output head later produces one score per vocabulary entry. This is why model conversion and runtime integration have to preserve tokenizer metadata carefully. If vocabulary order or special-token IDs shift, the model may still run but speak a different discrete language than the one it was trained on.
Vocabulary size |V| determines the size of the output layer (one score per entry). Larger vocabularies mean more work at the output head. This is why vocabulary size is a design decision, not just a data detail.
There is a tradeoff: a larger vocabulary can reduce token count for common patterns, but it also increases the embedding table and the final logits vector. Vocabulary design affects both model quality and serving cost.
During decode, the output head produces one score per vocabulary entry for every generated position.
That makes |V| a real systems number, not just a tokenizer fact hidden in metadata.
Worked Solution Exercise 1
Why could a model behave the same if every vocabulary ID were permuted, but only if the embedding table and output head were permuted too?
Because the integer itself does not carry intrinsic meaning. The meaning comes from the global agreement between tokenizer, embedding lookup, and output vocabulary. If you permute the IDs and permute every matching row and column consistently, the model is still speaking the same discrete language under renamed labels.
If you change only the tokenizer IDs and not the learned weights, the agreement breaks and the model reads the wrong symbols.
Worked Solution Exercise 2
A model has vocabulary size |V| = 32,000. What does that number influence at runtime besides tokenization?
It determines the size of both the input-side embedding table and the output-side logits vector. The embedding table has one learned row per vocabulary entry, and the output head produces one score per vocabulary entry for each predicted position.
That is why |V| is not just tokenizer metadata. It is also a memory and compute number.
What does the model see as input?
What happens when text contains a string that is not represented as one whole token in the vocabulary?
If the vocabulary has 32,000 entries, what is the range of valid token IDs?
If |V| = 50,000, how many logits does the model produce for one next-token prediction?