What Is a Token?
The model does not see characters or words. It sees tokens — pieces of text that come from a fixed vocabulary.
A token might be a whole word, a word fragment, a punctuation mark, or even a single character. The tokenizer decides how to split input text into these pieces, and different tokenizers make different choices.
The critical insight: token count and word count are not the same. "tokenization" might be one word but two or three tokens. A single space character might be its own token.
When people say a model has a "context length of 128K tokens," they mean 128,000 of these pieces — not 128,000 words.
This matters operationally: English prose, source code, JSON, log lines, emojis, and oddly formatted text can all produce very different token counts even when the character count looks similar. Engineers who reason in characters or words usually underestimate prompt cost.
Type text below and see how it splits into tokens. Try prose, punctuation-heavy text, repeated spaces, and code-like inputs. This widget is intentionally a teaching tokenizer, not the real tokenizer for any deployed model.
The explorer above is a teaching tokenizer. The widget below runs GPT-2's actual tokenizer in your browser. Try the same inputs and compare — notice how the real tokenizer handles spaces, punctuation, and emoji differently.
The sentence "The cat sat on the mat." tokenizes as:
6 words, 7 tokens. The · represents a leading space — it is part of the token, not separate.
Neural networks operate on numbers, not on strings. That sounds obvious, but it does not yet explain why we need tokens instead of just using raw character codes. In principle, we could feed the model one Unicode code point at a time. In practice, that would make sequences much longer, which would make every later stage of inference more expensive.
At the other extreme, we could try to store every whole word as its own symbol. That sounds convenient for English prose, but it breaks down quickly. Natural language contains inflections, typos, names, punctuation variants, code fragments, URLs, numbers, and multilingual text. A pure word-level vocabulary would either be enormous or would fail constantly on novel strings.
Subword tokenization is the compromise. Common patterns like " the", "ing", or "tion" can be single pieces, while rarer text can still be expressed by smaller fragments.
This keeps the vocabulary finite and reusable without forcing every input down to individual characters.
| Input pattern | Why token count can jump |
|---|---|
| hello, world!!! | Punctuation often becomes its own pieces or breaks merges that would work in plain prose. |
| foo(bar)==42 | Code mixes identifiers, punctuation, digits, and operators, which usually fragment more than natural language. |
| The··cat···sat | Whitespace is not free. Repeated spaces can create extra pieces or change token boundaries. |
| naïve🙂 | Rare Unicode combinations often split into smaller known pieces, so character count and token count drift apart. |
To build intuition for token counts in practice, here are rough numbers for common input types (using a typical BPE tokenizer with ~128K vocabulary):
The rule of thumb "1 token ≈ 0.75 words" works for English prose but breaks down for code, structured data, and non-Latin scripts, which typically cost more tokens per information unit.
Treat tokenization as a lossy view of text structure from the engineer's perspective. The tokenizer is not trying to preserve words, syntax trees, or user intent. It is trying to rewrite text as a sequence of entries from a finite vocabulary the model already knows how to embed.
That means the boundaries you care about as a human often disappear. A word may become several pieces. Several characters of punctuation may each become separate tokens. A leading space may be fused into the next token piece. The model does not know that one token "is a word" and another token "is punctuation." It only knows that each ID leads to a learned vector.
This is the right place to form a durable habit: when you estimate inference cost, think in tokenized sequence length, not in words, lines, or bytes. The raw string is what users see. The token sequence is what the model pays for.
The only math concept here is a discrete sequence. A tokenized input is an ordered list of symbols from a finite set:
where each tᵢ is a symbol from the vocabulary.
From a systems perspective, the most important number is n. Every later stage of the model scales with the token sequence length, not with the original number of characters.
Once text has been tokenized, everything downstream becomes sequence processing over n discrete items.
The embedding layer receives n IDs. Attention later runs over positions in that token sequence. Context limits are measured in tokens.
Sampling predicts one new token ID at a time.
This is why "prompt engineering" and "systems engineering" meet at tokenization. A reformatted prompt, an extra JSON wrapper, a long stack trace, or a pasted source file may change token count much more than the raw visual length suggests. If you are optimizing latency or cost, tokenization is not an implementation detail you can ignore.
If you want one concrete runtime anchor, open common/common.cpp and find common_tokenize().
That helper is the bridge from user-visible text to std::vector<llama_token>. It calls into the vocabulary implementation and returns integers, not strings.
Then open tools/completion/completion.cpp and find the line embd_inp = common_tokenize(ctx, prompt, true, true).
What to notice: tokenization is done before any model evaluation begins. Once that line runs, the rest of the runtime works with token IDs stored in embd_inp, not with the original text string.
This is also why tokenizer mismatch is fatal. The tokenizer and the embedding table are a paired contract: the embedding table is a preview of the very next module, where token IDs finally become vectors. Even before you know that math, the systems point already matters: the tokenizer decides which integer represents each piece of text, and the embedding table later decides which vector that integer fetches. Break the mapping and the model is reading the wrong discrete symbols before any transformer layer has a chance to help.
Tokenization is fast relative to model computation. But the token count determines how much work the model must do. More tokens means more computation in every layer. This is why "context length" is measured in tokens — it directly controls the cost.
This is why code, JSON, logs, and chat transcripts can feel "expensive" even when they are not visually long. They often compress poorly into tokens compared with ordinary prose.
A useful rule of thumb is that tokenization is rarely the latency bottleneck, but it is often the hidden reason the bottleneck is larger than expected. Engineers usually profile the model kernels and forget that the input format determined how much work those kernels had to do.
Worked Solution Exercise 1
Why might `foo(bar)==42` consume more tokens than a plain-English sentence with a similar number of characters?
Because tokenizers are built from reusable text pieces, not from a perfect understanding of source code.
Natural-language phrases often contain common merges such as " the" or "ing", while code mixes identifiers, punctuation, digits, and operators that fragment into smaller pieces more often.
The durable engineering habit is: do not estimate prompt cost from visual length. Measure with the exact tokenizer.
Worked Solution Exercise 2
Why is the sentence “the model never sees raw text” more than a slogan?
Because the runtime really does cross a boundary. Before tokenization, the input is a character string. After tokenization, the runtime works with integer token IDs. The model graph, embedding lookup, batching, and later decoding logic all operate on those IDs.
That means mistakes made at tokenization time are not cosmetic. They change the discrete sequence the whole model conditions on.
A user types "I can't believe it's not butter!" — predict: will this have more tokens or fewer tokens than words?
If you change one character in the middle of a word (e.g., "running" → "runnxng"), what happens to the tokenization?
Which input is most likely to consume more tokens than a similarly sized plain-English sentence?