Discrete Input Synthesis: Text, IDs, and Budget
The previous three lessons introduced the whole discrete interface one piece at a time: how text becomes tokens, how tokens become IDs, and how those IDs consume a finite context budget. A self-learner also needs one place where those ideas are put back together before the course moves into vectors and matrix math.
This page is that checkpoint. The goal is to make one durable sentence feel completely natural: before the model does any linear algebra, the runtime has already committed to a specific token sequence drawn from a fixed vocabulary and a specific budget for how long that sequence may become.
This is intentionally small. A tiny running example makes it easier to see the contracts clearly before the math module starts talking about embeddings and hidden states.
Keep this picture in mind as you leave the token module. It is the whole front door of inference before any embedding lookup or transformer math begins.
The input begins as a string that humans can read. The tokenizer rewrites that string as pieces from a fixed vocabulary. Those pieces do not have to align with words. Leading spaces may fuse into the next piece, punctuation may stand alone, and awkward text may fragment heavily.
That rewrite is already operationally meaningful. Sequence length, later cost, and eventual context fit all depend on the tokenized sequence, not on the original character string.
In the wild: when tokenization surprises you
"cat" at the start of a prompt may tokenize differently from " cat" mid-sentence because the leading space is part of the token piece.Each token piece is then mapped to a vocabulary ID. The numbers themselves are arbitrary labels, but the mapping is not arbitrary: the tokenizer, embedding table, and output vocabulary all have to agree on it.
At this point the runtime already has the model's discrete input:
The model has still not "seen text" in any human sense. It will only see this ID sequence and whatever special tokens or system-formatting tokens the runtime added around it.
Suppose the serving process is configured with a runtime context of 8 tokens. The prompt already costs 3 tokens, so only 5 remain for everything else: additional prompt formatting, generation, and any reserved answer budget.
If the product wants to guarantee up to 4 generated tokens, the request fits. If the product wants to guarantee up to 7 generated tokens, it does not fit. This decision happens before any attention scores, FFN activations, or logits are computed.
In real systems, the text a user thinks they sent is often not the whole request the model sees. A chat runtime may add a BOS token, role markers, separators, assistant-prefix tokens, or other control structure around the visible text.
The exact token pieces depend on the real tokenizer and chat template, but the systems lesson is stable: budgeting based only on visible user text is often wrong. Production token counts must include hidden formatting and control tokens as well as the text the user typed.
| Contract | What is fixed | What can vary per request or deployment |
|---|---|---|
| tokenization contract | the tokenizer vocabulary and merge rules | the input text that gets split |
| ID contract | the mapping from token pieces to integer IDs | which specific IDs appear in this request |
| budget contract | the runtime context configured for this process | prompt length, reserved output, and truncation policy |
If you want to see the three contracts in real code, use this map:
common/common.cpp and look at common_tokenize(). What to notice: user-visible text is converted into a vector of token IDs before model evaluation begins.
src/llama-batch.cpp and read llama_batch_allocr::init(). What to notice: the runtime validates that every token ID actually belongs to the vocabulary range.
tools/completion/completion.cpp. What to notice: the runtime tokenizes the prompt, compares prompt length against n_ctx, and refuses or trims requests that do not fit the configured budget.
The teaching widget in this module is for intuition, not for production counting. When you need the real number, measure with the exact tokenizer that ships with the model.
examples/simple/simple.cpp. It first calls llama_tokenize(..., NULL, 0, ...) to ask how many tokens the prompt will require, then allocates exactly that much space and tokenizes again.
Worked Solution Exercise 1
A prompt tokenizes to 2,800 tokens. The runtime context is 4,096, and the product wants to reserve 512 tokens for the answer. Does the request fit?
Yes. The request fits because 2,800 + 512 = 3,312, which is below the runtime budget of 4,096.
There are 4,096 - 3,312 = 784 tokens of slack left.
This is the production-style question you want to get used to asking: not "does the prompt fit?" but "does the prompt fit while preserving the answer budget we promised?"
Worked Solution Exercise 2
Why can't the runtime simply invent a new token ID for a novel string it has never seen before?
Because the model's discrete interface is fixed. The embedding table only has learned rows for the vocabulary IDs that already exist, and the output head only produces logits over that same fixed ID set.
A brand-new runtime ID would have no trained embedding row, no learned meaning, and no matching output vocabulary entry. The tokenizer has to decompose novel strings into known pieces instead.
Worked Solution Exercise 3
Why are words the wrong unit for capacity planning?
Because the runtime pays for tokenized sequence length, not for visual word count. The same number of words can produce very different token counts depending on punctuation, whitespace, code fragments, Unicode, or unusual text.
This is why tokenization is the real front door to inference cost. Words are what users see; tokens are what the model budgets.
Worked Solution Exercise 4
A chat request has these real token counts after formatting: system prompt 180, user message 920, hidden role/control tokens 64, retrieved context 1,900, reserved answer budget 512. The runtime context is 4,096. Does the request fit, and how much slack remains?
First compute the full budget, not just the visible user text:
180 + 920 + 64 + 1,900 + 512 = 3,576.
The request fits because 3,576 < 4,096. The slack is 4,096 - 3,576 = 520 tokens.
This is the production habit the module wants to teach: always count the formatted request, including hidden protocol tokens and reserved output budget, not just the text the user can see.
| Aspect | Stable for a given model/runtime | Varies per request or deployment |
|---|---|---|
| tokenizer / vocabulary | piece-to-ID mapping, special-token meanings | which pieces and IDs appear in this prompt |
| request formatting | chat template / protocol used by this runtime | actual system prompt, user content, retrieved context, hidden control tokens |
| context budget | configured n_ctx for this process | prompt length, reserved output, truncation or rejection outcome |
At the end of this module, the runtime has done three crucial things:
The next module finally answers the natural next question: once the runtime has a valid token-ID sequence, how do those integers become vectors the model can compute on?
What is the most accurate summary of the discrete-input stage?
A runtime has context 8,192 tokens. Your prompt tokenizes to 7,500 tokens. How many tokens remain for generation if you do no truncation?
Why is a tokenizer mismatch catastrophic rather than mildly inconvenient?