L04A

Discrete Input Synthesis: Text, IDs, and Budget

15 min

Question

Can you trace one prompt from raw text to token budget?

Why This Recap Exists

The previous three lessons introduced the whole discrete interface one piece at a time: how text becomes tokens, how tokens become IDs, and how those IDs consume a finite context budget. A self-learner also needs one place where those ideas are put back together before the course moves into vectors and matrix math.

This page is that checkpoint. The goal is to make one durable sentence feel completely natural: before the model does any linear algebra, the runtime has already committed to a specific token sequence drawn from a fixed vocabulary and a specific budget for how long that sequence may become.

The Running Prompt

raw text = "The cat sat"

token pieces = ["The", " cat", " sat"]

token IDs = [791, 2368, 3290]

predicted next token later = " on" (ID 373)

This is intentionally small. A tiny running example makes it easier to see the contracts clearly before the math module starts talking about embeddings and hidden states.

Diagram: The Discrete Interface

Keep this picture in mind as you leave the token module. It is the whole front door of inference before any embedding lookup or transformer math begins.

Raw Text

"The cat sat"

→

Token Pieces

["The", " cat", " sat"]

→

Token IDs

[791, 2368, 3290]

→

Budget Check

n_prompt + n_generated ≤ L

Step 1: Raw Text Becomes Token Pieces

The input begins as a string that humans can read. The tokenizer rewrites that string as pieces from a fixed vocabulary. Those pieces do not have to align with words. Leading spaces may fuse into the next piece, punctuation may stand alone, and awkward text may fragment heavily.

"The cat sat" → ["The", " cat", " sat"]

That rewrite is already operationally meaningful. Sequence length, later cost, and eventual context fit all depend on the tokenized sequence, not on the original character string.

In the wild: when tokenization surprises you

Leading space matters: "cat" at the start of a prompt may tokenize differently from " cat" mid-sentence because the leading space is part of the token piece.

Unicode splits: Emoji or CJK characters can fragment into multiple tokens. A single emoji may cost 2-4 tokens, consuming more budget than expected.

Code is expensive: Source code often fragments heavily — variable names, operators, and whitespace each become separate tokens, so a short code snippet can consume far more tokens than the same character count in prose.

Step 2: Token Pieces Become Token IDs

Each token piece is then mapped to a vocabulary ID. The numbers themselves are arbitrary labels, but the mapping is not arbitrary: the tokenizer, embedding table, and output vocabulary all have to agree on it.

"The" → 791

" cat" → 2368

" sat" → 3290

At this point the runtime already has the model's discrete input:

[791, 2368, 3290]

The model has still not "seen text" in any human sense. It will only see this ID sequence and whatever special tokens or system-formatting tokens the runtime added around it.

Step 3: The Runtime Checks Budget Before the Math Starts

Suppose the serving process is configured with a runtime context of 8 tokens. The prompt already costs 3 tokens, so only 5 remain for everything else: additional prompt formatting, generation, and any reserved answer budget.

L = 8

n_prompt = 3

remaining = L - n_prompt = 5

If the product wants to guarantee up to 4 generated tokens, the request fits. If the product wants to guarantee up to 7 generated tokens, it does not fit. This decision happens before any attention scores, FFN activations, or logits are computed.

Visible Prompt vs Actual Request

In real systems, the text a user thinks they sent is often not the whole request the model sees. A chat runtime may add a BOS token, role markers, separators, assistant-prefix tokens, or other control structure around the visible text.

visible user text = "Summarize this log"

actual request might look like:

[BOS] [SYSTEM] "You are concise." [USER] "Summarize this log" [ASSISTANT]

The exact token pieces depend on the real tokenizer and chat template, but the systems lesson is stable: budgeting based only on visible user text is often wrong. Production token counts must include hidden formatting and control tokens as well as the text the user typed.

One Table, Three Contracts

Contract	What is fixed	What can vary per request or deployment
tokenization contract	the tokenizer vocabulary and merge rules	the input text that gets split
ID contract	the mapping from token pieces to integer IDs	which specific IDs appear in this request
budget contract	the runtime context configured for this process	prompt length, reserved output, and truncation policy

Code Reading Map

If you want to see the three contracts in real code, use this map:

Tokenization

Open common/common.cpp and look at common_tokenize(). What to notice: user-visible text is converted into a vector of token IDs before model evaluation begins.

Vocabulary / ID validity

Open src/llama-batch.cpp and read llama_batch_allocr::init(). What to notice: the runtime validates that every token ID actually belongs to the vocabulary range.

Context budget

Open tools/completion/completion.cpp. What to notice: the runtime tokenizes the prompt, compares prompt length against n_ctx, and refuses or trims requests that do not fit the configured budget.

How to Measure with the Real Tokenizer

The teaching widget in this module is for intuition, not for production counting. When you need the real number, measure with the exact tokenizer that ships with the model.

Concrete example

Open examples/simple/simple.cpp. It first calls llama_tokenize(..., NULL, 0, ...) to ask how many tokens the prompt will require, then allocates exactly that much space and tokenizes again.

Operational rule

If latency, context fit, or cost matters, never estimate from words and never trust a toy tokenizer. Run the real tokenizer against the real formatted prompt.

Worked Practice with Solutions

Worked Solution Exercise 1

A prompt tokenizes to 2,800 tokens. The runtime context is 4,096, and the product wants to reserve 512 tokens for the answer. Does the request fit?

Yes. The request fits because 2,800 + 512 = 3,312, which is below the runtime budget of 4,096. There are 4,096 - 3,312 = 784 tokens of slack left.

This is the production-style question you want to get used to asking: not "does the prompt fit?" but "does the prompt fit while preserving the answer budget we promised?"

Worked Solution Exercise 2

Why can't the runtime simply invent a new token ID for a novel string it has never seen before?

Because the model's discrete interface is fixed. The embedding table only has learned rows for the vocabulary IDs that already exist, and the output head only produces logits over that same fixed ID set.

A brand-new runtime ID would have no trained embedding row, no learned meaning, and no matching output vocabulary entry. The tokenizer has to decompose novel strings into known pieces instead.

Worked Solution Exercise 3

Why are words the wrong unit for capacity planning?

Because the runtime pays for tokenized sequence length, not for visual word count. The same number of words can produce very different token counts depending on punctuation, whitespace, code fragments, Unicode, or unusual text.

This is why tokenization is the real front door to inference cost. Words are what users see; tokens are what the model budgets.

Worked Solution Exercise 4

A chat request has these real token counts after formatting: system prompt 180, user message 920, hidden role/control tokens 64, retrieved context 1,900, reserved answer budget 512. The runtime context is 4,096. Does the request fit, and how much slack remains?

First compute the full budget, not just the visible user text: 180 + 920 + 64 + 1,900 + 512 = 3,576.

The request fits because 3,576 < 4,096. The slack is 4,096 - 3,576 = 520 tokens.

This is the production habit the module wants to teach: always count the formatted request, including hidden protocol tokens and reserved output budget, not just the text the user can see.

What to Memorize vs What to Measure

Memorize

text becomes token pieces, token pieces become IDs, IDs come from a fixed vocabulary, and prompt plus output share one runtime token budget.

Measure

the exact token count of any real prompt, especially when chat formatting, system prompts, RAG context, or special tokens may add hidden overhead.

Do not assume

that words predict token count, that visible user text equals the full request, or that a model's advertised context automatically matches the runtime budget of a deployed process.

What Stayed Fixed, What Varied

Aspect	Stable for a given model/runtime	Varies per request or deployment
tokenizer / vocabulary	piece-to-ID mapping, special-token meanings	which pieces and IDs appear in this prompt
request formatting	chat template / protocol used by this runtime	actual system prompt, user content, retrieved context, hidden control tokens
context budget	configured n_ctx for this process	prompt length, reserved output, truncation or rejection outcome

Bridge to Linear Algebra

At the end of this module, the runtime has done three crucial things:

1. rewrote text as token pieces

2. rewrote token pieces as token IDs

3. checked that the ID sequence fits the deployed token budget

The next module finally answers the natural next question: once the runtime has a valid token-ID sequence, how do those integers become vectors the model can compute on?

Check Yourself

conceptualQ1

What is the most accurate summary of the discrete-input stage?

The runtime turns text into token IDs and checks that the resulting sequence fits the configured token budget before model math beginsThe runtime turns text directly into embeddings and then counts how many words were usedThe runtime lets the model invent new token IDs if the prompt is unusual

mathQ2

A runtime has context 8,192 tokens. Your prompt tokenizes to 7,500 tokens. How many tokens remain for generation if you do no truncation?

shapeQ3

Why is a tokenizer mismatch catastrophic rather than mildly inconvenient?

Because the model will immediately refuse to run whenever the wrong tokenizer is usedBecause the wrong tokenizer changes the discrete ID sequence, so the embedding lookup and downstream computation start from the wrong symbolsBecause it only affects the final detokenization step after generation