M1/Discrete Input
L04

Context Length Means Token Capacity

8 min
What does context length measure?

The model has a maximum number of tokens it can process at once. This is the context length (or context window).

Both the prompt and the generated output share this budget. If the context length is 8,192 tokens and your prompt uses 6,000, there are only 2,192 tokens left for generation.

Context length is not measured in characters or words. It is measured in tokens, because that is the unit the model operates on.

In practice, context length is not just a spec-sheet number. It determines whether a prompt fits at all, how much room remains for the answer, and how expensive long prompts become at runtime.

For serving work, keep one extra distinction in mind from the start: there is the model's trained or advertised context capability, and there is the context size actually configured for a running process. The deployed budget users experience is the runtime-configured one.

Adjust the prompt and output sizes below. Watch how a request can fail even before generation starts if the prompt already consumes most of the context window.

Try a scenario
Request fits
prompt
1992tokens left after prompt
800tokens the model can actually generate
0tokens that do not fit
Prefill cost: all 6,200 prompt tokens must be processed before generation starts. Longer prompts delay the first output token even if you only want a short answer.
Generation budget: the requested answer is capped by the remaining 1,992 tokens. If you ask for more, the runtime must truncate, reject, or evict earlier context.

A model with context length = 10 tokens:

The
·cat
·sat
·on
·the
·mat
.
■ prompt: 7 tokens □ available: 3 tokens
Scenario Why context matters
Short chat turn Fits easily, so there is plenty of room for generation and the prompt is cheap to prefill.
RAG prompt with many retrieved passages The context window is shared by instructions, retrieved text, citations, and the answer budget.
Long source file or transcript Even if it fits, the model must process every prompt token first, so time-to-first-token gets much worse.

Context length is not a promise that the model will perfectly reason over every token in a huge prompt. It is a capacity limit: the maximum sequence length the runtime and model architecture are prepared to handle.

It is also not the same thing as "memory" in the ordinary English sense. A model with a longer context window can carry more prior tokens forward, but that does not guarantee good retrieval, faithful summarization, or equal attention to every earlier detail. Those are quality questions layered on top of the raw capacity limit.

Engineers often say "the model has an 8K context" as if that were one number. In practice there are at least two related numbers. The model metadata records what context length it was trained for or expects architecturally. The runtime then chooses a context size for this serving process.

Those can differ. A server may configure a smaller runtime context to save memory, or it may try a larger one with scaling tricks and warnings. So the effective user-facing budget is not just a property of the checkpoint. It is also a deployment choice.

This is the systems distinction to keep: trained capability describes what the model was built for, while runtime context describes what this process will actually allocate and accept.

Trained context: 8,192 tokens
Runtime at deploy A: n_ctx = 4,096 — saves memory, limits user prompts
Runtime at deploy B: n_ctx = 8,192 — uses full trained range
Runtime at deploy C: n_ctx = 16,384 — extended via RoPE scaling, quality may degrade
runtime context budget L (or n_ctx): maximum tokens this serving process is configured to handle
n_prompt + n_generated ≤ L
remaining_budget = L - n_prompt
active sequence length grows from n_prompt to n_prompt + n_generated during generation

The only new math concept is a capacity constraint:

n_prompt + n_generated ≤ L

where L is the context length. Both the input and the output compete for the same budget.

remaining = L - n_prompt

If you request more than remaining output tokens, the runtime has to refuse, truncate the request, or drop earlier context depending on its policy.

There are two distinct moments to keep in mind. First, the runtime must process the entire prompt before generation can begin. This is often called prefill. A 20,000-token prompt is expensive even if the answer will be only 10 tokens long, because all 20,000 prompt tokens still have to go through the model first.

Second, once generation starts, each new token is produced while carrying the accumulated context forward. That means later generated tokens are not independent of prompt length — the model must reference all prior tokens when computing attention, and the memory needed to store that history grows with each new token. A long active context can slow generation even if the per-token decode path is cheaper than prefill. You will learn the exact mechanisms (attention scaling and KV cache growth) in later modules.

This is the systems reason long prompts can feel bad twice: slow time-to-first-token and slower per-token generation afterward.

Real serving systems have to choose a policy when a request exceeds the context budget. The simplest option is to reject the request. Another common option is to truncate the prompt or cap the allowed output length. Some systems also drop older tokens from a conversation and keep only the most recent span.

None of these are "free." Rejection hurts usability. Truncation may silently remove important instructions. Sliding windows or eviction policies may keep the request alive but change what information the model can still condition on.

This is why context budgeting is a product decision as much as a systems one. The runtime needs a clear policy for what to protect: instructions, recent turns, retrieved documents, or answer budget.

Open src/llama-cparams.h and notice the field uint32_t n_ctx. That is the context size used during inference for a specific runtime instance. It is the deployment-side budget.

Then open tools/completion/completion.cpp and look at two places. First, the runtime compares llama_model_n_ctx_train(model) with llama_n_ctx(ctx) and warns if the configured runtime context exceeds training context. Second, the code checks if ((int) embd_inp.size() > n_ctx - 4) and rejects prompts that do not fit.

What to notice: the code is modeling two different ideas at once. The checkpoint carries a training-time context expectation, but the runtime enforces the actual configured budget for this process. That is why inference servers expose knobs like context size and max generated tokens separately, and why production systems reserve answer budget explicitly instead of letting the prompt consume the whole window by accident.

Longer context means more tokens for the model to process — more computation in every layer. The cost grows with the number of tokens. You will learn exactly how later.

Two costs matter in practice. Prefill cost depends on how many prompt tokens you send up front. Decode cost depends on how many tokens the model has to carry while generating each new token. This is why a long prompt can make both the first token and later tokens slower.

The practical lesson is simple: context length is a budget you spend, not a badge you own. If you spend it carelessly on verbose prompts, oversized retrieval payloads, or raw logs, you will pay in latency, memory pressure, and reduced room for useful output.

Worked Solution Exercise 1

A model was trained for 8,192 tokens, but your server is configured for n_ctx = 4,096. A prompt tokenizes to 3,500 tokens. How much room remains for the answer, and which number actually governs the request?

The remaining answer budget is 4,096 - 3,500 = 596 tokens. The governing number for this request is the runtime-configured context, 4,096, because that is what the serving process actually allocates and enforces.

The training-time 8,192-token capability still matters as background model metadata, but it does not magically give this process a larger working budget.

Worked Solution Exercise 2

A model was trained for 8,192 tokens, but the runtime is configured for 16,384 and logs a warning. What changed, and what did not change?

The deployment configuration changed: this process is attempting to run with a larger effective context than the model was originally trained for. What did not change is the model's training history or architecture metadata.

This is why "supports 16K" and "was trained for 8K" are different claims. One is a runtime choice; the other is a property of the checkpoint and its training regime.

Check Yourself
mathQ1

A model has context length 4096. Your prompt tokenizes to 3500 tokens. How many tokens can the model generate?

conceptualQ2

Why can a long prompt make an answer slow even if you only ask for 20 new tokens?

shapeQ3

You have an 8,192-token context window, a 7,900-token prompt, and you request 600 output tokens. What has to happen?