L02

What Is a Token?

10 min

Question

What is a token?

Intuition

The model does not see characters or words. It sees tokens — pieces of text that come from a fixed vocabulary.

A token might be a whole word, a word fragment, a punctuation mark, or even a single character. The tokenizer decides how to split input text into these pieces, and different tokenizers make different choices.

The critical insight: token count and word count are not the same. "tokenization" might be one word but two or three tokens. A single space character might be its own token.

When people say a model has a "context length of 128K tokens," they mean 128,000 of these pieces — not 128,000 words.

This matters operationally: English prose, source code, JSON, log lines, emojis, and oddly formatted text can all produce very different token counts even when the character count looks similar. Engineers who reason in characters or words usually underestimate prompt cost.

Interactive — Tokenizer Explorer

Type text below and see how it splits into tokens. Try prose, punctuation-heavy text, repeated spaces, and code-like inputs. This widget is intentionally a teaching tokenizer, not the real tokenizer for any deployed model.

Teaching approximation only. This widget demonstrates how token pieces and IDs behave, but it is not the real tokenizer for any production model. For cost, context, or accuracy work, always measure token counts with the exact tokenizer shipped with the model you serve.

Enter text

23characters

6words

7tokens

Token Pieces

The·cat·sat·on·the·mat.

Token IDs

The↓791

·cat↓2368

·sat↓3290

·on↓373

·the↓279

·mat↓1744

.↓13

Try It Live — Real GPT-2 Tokenizer

The explorer above is a teaching tokenizer. The widget below runs GPT-2's actual tokenizer in your browser. Try the same inputs and compare — notice how the real tokenizer handles spaces, punctuation, and emoji differently.

Toy Example

The sentence "The cat sat on the mat." tokenizes as:

The ·cat ·sat ·on ·the ·mat .

6 words, 7 tokens. The · represents a leading space — it is part of the token, not separate.

Why Models Do Not Read Raw Text

Neural networks operate on numbers, not on strings. That sounds obvious, but it does not yet explain why we need tokens instead of just using raw character codes. In principle, we could feed the model one Unicode code point at a time. In practice, that would make sequences much longer, which would make every later stage of inference more expensive.

At the other extreme, we could try to store every whole word as its own symbol. That sounds convenient for English prose, but it breaks down quickly. Natural language contains inflections, typos, names, punctuation variants, code fragments, URLs, numbers, and multilingual text. A pure word-level vocabulary would either be enormous or would fail constantly on novel strings.

Subword tokenization is the compromise. Common patterns like " the", "ing", or "tion" can be single pieces, while rarer text can still be expressed by smaller fragments. This keeps the vocabulary finite and reusable without forcing every input down to individual characters.

Where Engineers Get Surprised

Input pattern	Why token count can jump
hello, world!!!	Punctuation often becomes its own pieces or breaks merges that would work in plain prose.
foo(bar)==42	Code mixes identifiers, punctuation, digits, and operators, which usually fragment more than natural language.
The··cat···sat	Whitespace is not free. Repeated spaces can create extra pieces or change token boundaries.
naïve🙂	Rare Unicode combinations often split into smaller known pieces, so character count and token count drift apart.

Concrete Token Costs

To build intuition for token counts in practice, here are rough numbers for common input types (using a typical BPE tokenizer with ~128K vocabulary):

1 page of English prose (~500 words) ≈ ~650 tokens

A 10-line Python function ≈ ~80-120 tokens

A JSON object with 5 key-value pairs ≈ ~40-60 tokens

A URL like https://example.com/path ≈ ~10-15 tokens

A single emoji (🎉) ≈ ~2-4 tokens

The rule of thumb "1 token ≈ 0.75 words" works for English prose but breaks down for code, structured data, and non-Latin scripts, which typically cost more tokens per information unit.

A Better Mental Model

Treat tokenization as a lossy view of text structure from the engineer's perspective. The tokenizer is not trying to preserve words, syntax trees, or user intent. It is trying to rewrite text as a sequence of entries from a finite vocabulary the model already knows how to embed.

That means the boundaries you care about as a human often disappear. A word may become several pieces. Several characters of punctuation may each become separate tokens. A leading space may be fused into the next token piece. The model does not know that one token "is a word" and another token "is punctuation." It only knows that each ID leads to a learned vector.

This is the right place to form a durable habit: when you estimate inference cost, think in tokenized sequence length, not in words, lines, or bytes. The raw string is what users see. The token sequence is what the model pays for.

Shapes

Input text: a string of characters

After tokenization: a sequence of token pieces

Sequence length (n_tokens): the number of tokens in the sequence

n_tokens ≠ n_words (usually n_tokens > n_words)

Two strings with similar character counts can still have very different n_tokens.

Math

The only math concept here is a discrete sequence. A tokenized input is an ordered list of symbols from a finite set:

tokens = [t₁, t₂, ..., tₙ]

where each tᵢ is a symbol from the vocabulary.

From a systems perspective, the most important number is n. Every later stage of the model scales with the token sequence length, not with the original number of characters.

Practical Consequences

Once text has been tokenized, everything downstream becomes sequence processing over n discrete items. The embedding layer receives n IDs. Attention later runs over positions in that token sequence. Context limits are measured in tokens. Sampling predicts one new token ID at a time.

This is why "prompt engineering" and "systems engineering" meet at tokenization. A reformatted prompt, an extra JSON wrapper, a long stack trace, or a pasted source file may change token count much more than the raw visual length suggests. If you are optimizing latency or cost, tokenization is not an implementation detail you can ignore.

Implementation Hook

If you want one concrete runtime anchor, open common/common.cpp and find common_tokenize(). That helper is the bridge from user-visible text to std::vector<llama_token>. It calls into the vocabulary implementation and returns integers, not strings.

Then open tools/completion/completion.cpp and find the line embd_inp = common_tokenize(ctx, prompt, true, true). What to notice: tokenization is done before any model evaluation begins. Once that line runs, the rest of the runtime works with token IDs stored in embd_inp, not with the original text string.

This is also why tokenizer mismatch is fatal. The tokenizer and the embedding table are a paired contract: the embedding table is a preview of the very next module, where token IDs finally become vectors. Even before you know that math, the systems point already matters: the tokenizer decides which integer represents each piece of text, and the embedding table later decides which vector that integer fetches. Break the mapping and the model is reading the wrong discrete symbols before any transformer layer has a chance to help.

common/common.cpp — common_tokenize() (L1533)

Performance Hook

Tokenization is fast relative to model computation. But the token count determines how much work the model must do. More tokens means more computation in every layer. This is why "context length" is measured in tokens — it directly controls the cost.

This is why code, JSON, logs, and chat transcripts can feel "expensive" even when they are not visually long. They often compress poorly into tokens compared with ordinary prose.

A useful rule of thumb is that tokenization is rarely the latency bottleneck, but it is often the hidden reason the bottleneck is larger than expected. Engineers usually profile the model kernels and forget that the input format determined how much work those kernels had to do.

Worked Practice with Solutions

Worked Solution Exercise 1

Why might `foo(bar)==42` consume more tokens than a plain-English sentence with a similar number of characters?

Because tokenizers are built from reusable text pieces, not from a perfect understanding of source code. Natural-language phrases often contain common merges such as " the" or "ing", while code mixes identifiers, punctuation, digits, and operators that fragment into smaller pieces more often.

The durable engineering habit is: do not estimate prompt cost from visual length. Measure with the exact tokenizer.

Worked Solution Exercise 2

Why is the sentence “the model never sees raw text” more than a slogan?

Because the runtime really does cross a boundary. Before tokenization, the input is a character string. After tokenization, the runtime works with integer token IDs. The model graph, embedding lookup, batching, and later decoding logic all operate on those IDs.

That means mistakes made at tokenization time are not cosmetic. They change the discrete sequence the whole model conditions on.

Check Yourself

reasoningQ1

A user types "I can't believe it's not butter!" — predict: will this have more tokens or fewer tokens than words?

Fewer tokens — the tokenizer merges common phrasesMore tokens — contractions and punctuation likely create extra token piecesExactly the same — one word equals one token

reasoningQ2

If you change one character in the middle of a word (e.g., "running" → "runnxng"), what happens to the tokenization?

Only the changed character's token changesThe whole word may re-tokenize into completely different pieces because token merges depend on character sequencesNothing changes — the tokenizer works on whole words

shapeQ3

Which input is most likely to consume more tokens than a similarly sized plain-English sentence?

A short paragraph of ordinary prose with common wordsA punctuation-heavy code snippet like foo(bar)==42A sentence with the same common word repeated