M1/Discrete Input
L04A

Discrete Input Synthesis: Text, IDs, and Budget

15 min
Can you trace one prompt from raw text to token budget?

The previous three lessons introduced the whole discrete interface one piece at a time: how text becomes tokens, how tokens become IDs, and how those IDs consume a finite context budget. A self-learner also needs one place where those ideas are put back together before the course moves into vectors and matrix math.

This page is that checkpoint. The goal is to make one durable sentence feel completely natural: before the model does any linear algebra, the runtime has already committed to a specific token sequence drawn from a fixed vocabulary and a specific budget for how long that sequence may become.

raw text = "The cat sat"
token pieces = ["The", " cat", " sat"]
token IDs = [791, 2368, 3290]
predicted next token later = " on" (ID 373)

This is intentionally small. A tiny running example makes it easier to see the contracts clearly before the math module starts talking about embeddings and hidden states.

Keep this picture in mind as you leave the token module. It is the whole front door of inference before any embedding lookup or transformer math begins.

Raw Text
"The cat sat"
Token Pieces
["The", " cat", " sat"]
Token IDs
[791, 2368, 3290]
Budget Check
n_prompt + n_generated ≤ L

The input begins as a string that humans can read. The tokenizer rewrites that string as pieces from a fixed vocabulary. Those pieces do not have to align with words. Leading spaces may fuse into the next piece, punctuation may stand alone, and awkward text may fragment heavily.

"The cat sat" → ["The", " cat", " sat"]

That rewrite is already operationally meaningful. Sequence length, later cost, and eventual context fit all depend on the tokenized sequence, not on the original character string.

In the wild: when tokenization surprises you

Leading space matters: "cat" at the start of a prompt may tokenize differently from " cat" mid-sentence because the leading space is part of the token piece.
Unicode splits: Emoji or CJK characters can fragment into multiple tokens. A single emoji may cost 2-4 tokens, consuming more budget than expected.
Code is expensive: Source code often fragments heavily — variable names, operators, and whitespace each become separate tokens, so a short code snippet can consume far more tokens than the same character count in prose.

Each token piece is then mapped to a vocabulary ID. The numbers themselves are arbitrary labels, but the mapping is not arbitrary: the tokenizer, embedding table, and output vocabulary all have to agree on it.

"The" → 791
" cat" → 2368
" sat" → 3290

At this point the runtime already has the model's discrete input:

[791, 2368, 3290]

The model has still not "seen text" in any human sense. It will only see this ID sequence and whatever special tokens or system-formatting tokens the runtime added around it.

Suppose the serving process is configured with a runtime context of 8 tokens. The prompt already costs 3 tokens, so only 5 remain for everything else: additional prompt formatting, generation, and any reserved answer budget.

L = 8
n_prompt = 3
remaining = L - n_prompt = 5

If the product wants to guarantee up to 4 generated tokens, the request fits. If the product wants to guarantee up to 7 generated tokens, it does not fit. This decision happens before any attention scores, FFN activations, or logits are computed.

In real systems, the text a user thinks they sent is often not the whole request the model sees. A chat runtime may add a BOS token, role markers, separators, assistant-prefix tokens, or other control structure around the visible text.

visible user text = "Summarize this log"
actual request might look like:
[BOS] [SYSTEM] "You are concise." [USER] "Summarize this log" [ASSISTANT]

The exact token pieces depend on the real tokenizer and chat template, but the systems lesson is stable: budgeting based only on visible user text is often wrong. Production token counts must include hidden formatting and control tokens as well as the text the user typed.

Contract What is fixed What can vary per request or deployment
tokenization contract the tokenizer vocabulary and merge rules the input text that gets split
ID contract the mapping from token pieces to integer IDs which specific IDs appear in this request
budget contract the runtime context configured for this process prompt length, reserved output, and truncation policy

If you want to see the three contracts in real code, use this map:

Tokenization
Open common/common.cpp and look at common_tokenize(). What to notice: user-visible text is converted into a vector of token IDs before model evaluation begins.
Vocabulary / ID validity
Open src/llama-batch.cpp and read llama_batch_allocr::init(). What to notice: the runtime validates that every token ID actually belongs to the vocabulary range.
Context budget
Open tools/completion/completion.cpp. What to notice: the runtime tokenizes the prompt, compares prompt length against n_ctx, and refuses or trims requests that do not fit the configured budget.

The teaching widget in this module is for intuition, not for production counting. When you need the real number, measure with the exact tokenizer that ships with the model.

Concrete example
Open examples/simple/simple.cpp. It first calls llama_tokenize(..., NULL, 0, ...) to ask how many tokens the prompt will require, then allocates exactly that much space and tokenizes again.
Operational rule
If latency, context fit, or cost matters, never estimate from words and never trust a toy tokenizer. Run the real tokenizer against the real formatted prompt.
Worked Solution Exercise 1

A prompt tokenizes to 2,800 tokens. The runtime context is 4,096, and the product wants to reserve 512 tokens for the answer. Does the request fit?

Yes. The request fits because 2,800 + 512 = 3,312, which is below the runtime budget of 4,096. There are 4,096 - 3,312 = 784 tokens of slack left.

This is the production-style question you want to get used to asking: not "does the prompt fit?" but "does the prompt fit while preserving the answer budget we promised?"

Worked Solution Exercise 2

Why can't the runtime simply invent a new token ID for a novel string it has never seen before?

Because the model's discrete interface is fixed. The embedding table only has learned rows for the vocabulary IDs that already exist, and the output head only produces logits over that same fixed ID set.

A brand-new runtime ID would have no trained embedding row, no learned meaning, and no matching output vocabulary entry. The tokenizer has to decompose novel strings into known pieces instead.

Worked Solution Exercise 3

Why are words the wrong unit for capacity planning?

Because the runtime pays for tokenized sequence length, not for visual word count. The same number of words can produce very different token counts depending on punctuation, whitespace, code fragments, Unicode, or unusual text.

This is why tokenization is the real front door to inference cost. Words are what users see; tokens are what the model budgets.

Worked Solution Exercise 4

A chat request has these real token counts after formatting: system prompt 180, user message 920, hidden role/control tokens 64, retrieved context 1,900, reserved answer budget 512. The runtime context is 4,096. Does the request fit, and how much slack remains?

First compute the full budget, not just the visible user text: 180 + 920 + 64 + 1,900 + 512 = 3,576.

The request fits because 3,576 < 4,096. The slack is 4,096 - 3,576 = 520 tokens.

This is the production habit the module wants to teach: always count the formatted request, including hidden protocol tokens and reserved output budget, not just the text the user can see.

Memorize
text becomes token pieces, token pieces become IDs, IDs come from a fixed vocabulary, and prompt plus output share one runtime token budget.
Measure
the exact token count of any real prompt, especially when chat formatting, system prompts, RAG context, or special tokens may add hidden overhead.
Do not assume
that words predict token count, that visible user text equals the full request, or that a model's advertised context automatically matches the runtime budget of a deployed process.
Aspect Stable for a given model/runtime Varies per request or deployment
tokenizer / vocabulary piece-to-ID mapping, special-token meanings which pieces and IDs appear in this prompt
request formatting chat template / protocol used by this runtime actual system prompt, user content, retrieved context, hidden control tokens
context budget configured n_ctx for this process prompt length, reserved output, truncation or rejection outcome

At the end of this module, the runtime has done three crucial things:

1. rewrote text as token pieces
2. rewrote token pieces as token IDs
3. checked that the ID sequence fits the deployed token budget

The next module finally answers the natural next question: once the runtime has a valid token-ID sequence, how do those integers become vectors the model can compute on?

Check Yourself
conceptualQ1

What is the most accurate summary of the discrete-input stage?

mathQ2

A runtime has context 8,192 tokens. Your prompt tokenizes to 7,500 tokens. How many tokens remain for generation if you do no truncation?

shapeQ3

Why is a tokenizer mismatch catastrophic rather than mildly inconvenient?