L04

Context Length Means Token Capacity

8 min

Question

What does context length measure?

Intuition

The model has a maximum number of tokens it can process at once. This is the context length (or context window).

Both the prompt and the generated output share this budget. If the context length is 8,192 tokens and your prompt uses 6,000, there are only 2,192 tokens left for generation.

Context length is not measured in characters or words. It is measured in tokens, because that is the unit the model operates on.

In practice, context length is not just a spec-sheet number. It determines whether a prompt fits at all, how much room remains for the answer, and how expensive long prompts become at runtime.

For serving work, keep one extra distinction in mind from the start: there is the model's trained or advertised context capability, and there is the context size actually configured for a running process. The deployed budget users experience is the runtime-configured one.

Interactive — Budget Planner

Adjust the prompt and output sizes below. Watch how a request can fail even before generation starts if the prompt already consumes most of the context window.

Try a scenario

Request fits

Context lengthPrompt tokensRequested output tokens

prompt

1992tokens left after prompt

800tokens the model can actually generate

0tokens that do not fit

Prefill cost: all 6,200 prompt tokens must be processed before generation starts. Longer prompts delay the first output token even if you only want a short answer.

Generation budget: the requested answer is capped by the remaining 1,992 tokens. If you ask for more, the runtime must truncate, reject, or evict earlier context.

Toy Example

A model with context length = 10 tokens:

The

·cat

·sat

·on

·the

·mat

■ prompt: 7 tokens □ available: 3 tokens

Operational Cases

Scenario	Why context matters
Short chat turn	Fits easily, so there is plenty of room for generation and the prompt is cheap to prefill.
RAG prompt with many retrieved passages	The context window is shared by instructions, retrieved text, citations, and the answer budget.
Long source file or transcript	Even if it fits, the model must process every prompt token first, so time-to-first-token gets much worse.

What Context Length Is Not

Context length is not a promise that the model will perfectly reason over every token in a huge prompt. It is a capacity limit: the maximum sequence length the runtime and model architecture are prepared to handle.

It is also not the same thing as "memory" in the ordinary English sense. A model with a longer context window can carry more prior tokens forward, but that does not guarantee good retrieval, faithful summarization, or equal attention to every earlier detail. Those are quality questions layered on top of the raw capacity limit.

Architectural Context vs Runtime Context

Engineers often say "the model has an 8K context" as if that were one number. In practice there are at least two related numbers. The model metadata records what context length it was trained for or expects architecturally. The runtime then chooses a context size for this serving process.

Those can differ. A server may configure a smaller runtime context to save memory, or it may try a larger one with scaling tricks and warnings. So the effective user-facing budget is not just a property of the checkpoint. It is also a deployment choice.

This is the systems distinction to keep: trained capability describes what the model was built for, while runtime context describes what this process will actually allocate and accept.

Trained context: 8,192 tokens

Runtime at deploy A: n_ctx = 4,096 — saves memory, limits user prompts

Runtime at deploy B: n_ctx = 8,192 — uses full trained range

Runtime at deploy C: n_ctx = 16,384 — extended via RoPE scaling, quality may degrade

Shapes

runtime context budget L (or n_ctx): maximum tokens this serving process is configured to handle

n_prompt + n_generated ≤ L

remaining_budget = L - n_prompt

active sequence length grows from n_prompt to n_prompt + n_generated during generation

Math

The only new math concept is a capacity constraint:

n_prompt + n_generated ≤ L

where L is the context length. Both the input and the output compete for the same budget.

remaining = L - n_prompt

If you request more than remaining output tokens, the runtime has to refuse, truncate the request, or drop earlier context depending on its policy.

Why Long Context Becomes Expensive

There are two distinct moments to keep in mind. First, the runtime must process the entire prompt before generation can begin. This is often called prefill. A 20,000-token prompt is expensive even if the answer will be only 10 tokens long, because all 20,000 prompt tokens still have to go through the model first.

Second, once generation starts, each new token is produced while carrying the accumulated context forward. That means later generated tokens are not independent of prompt length — the model must reference all prior tokens when computing attention, and the memory needed to store that history grows with each new token. A long active context can slow generation even if the per-token decode path is cheaper than prefill. You will learn the exact mechanisms (attention scaling and KV cache growth) in later modules.

This is the systems reason long prompts can feel bad twice: slow time-to-first-token and slower per-token generation afterward.

Runtime Policies When the Budget Does Not Fit

Real serving systems have to choose a policy when a request exceeds the context budget. The simplest option is to reject the request. Another common option is to truncate the prompt or cap the allowed output length. Some systems also drop older tokens from a conversation and keep only the most recent span.

None of these are "free." Rejection hurts usability. Truncation may silently remove important instructions. Sliding windows or eviction policies may keep the request alive but change what information the model can still condition on.

This is why context budgeting is a product decision as much as a systems one. The runtime needs a clear policy for what to protect: instructions, recent turns, retrieved documents, or answer budget.

Implementation Hook

Open src/llama-cparams.h and notice the field uint32_t n_ctx. That is the context size used during inference for a specific runtime instance. It is the deployment-side budget.

Then open tools/completion/completion.cpp and look at two places. First, the runtime compares llama_model_n_ctx_train(model) with llama_n_ctx(ctx) and warns if the configured runtime context exceeds training context. Second, the code checks if ((int) embd_inp.size() > n_ctx - 4) and rejects prompts that do not fit.

What to notice: the code is modeling two different ideas at once. The checkpoint carries a training-time context expectation, but the runtime enforces the actual configured budget for this process. That is why inference servers expose knobs like context size and max generated tokens separately, and why production systems reserve answer budget explicitly instead of letting the prompt consume the whole window by accident.

src/llama-cparams.h — llama_cparams (L9)

Performance Hook

Longer context means more tokens for the model to process — more computation in every layer. The cost grows with the number of tokens. You will learn exactly how later.

Two costs matter in practice. Prefill cost depends on how many prompt tokens you send up front. Decode cost depends on how many tokens the model has to carry while generating each new token. This is why a long prompt can make both the first token and later tokens slower.

The practical lesson is simple: context length is a budget you spend, not a badge you own. If you spend it carelessly on verbose prompts, oversized retrieval payloads, or raw logs, you will pay in latency, memory pressure, and reduced room for useful output.

Worked Practice with Solutions

Worked Solution Exercise 1

A model was trained for 8,192 tokens, but your server is configured for n_ctx = 4,096. A prompt tokenizes to 3,500 tokens. How much room remains for the answer, and which number actually governs the request?

The remaining answer budget is 4,096 - 3,500 = 596 tokens. The governing number for this request is the runtime-configured context, 4,096, because that is what the serving process actually allocates and enforces.

The training-time 8,192-token capability still matters as background model metadata, but it does not magically give this process a larger working budget.

Worked Solution Exercise 2

A model was trained for 8,192 tokens, but the runtime is configured for 16,384 and logs a warning. What changed, and what did not change?

The deployment configuration changed: this process is attempting to run with a larger effective context than the model was originally trained for. What did not change is the model's training history or architecture metadata.

This is why "supports 16K" and "was trained for 8K" are different claims. One is a runtime choice; the other is a property of the checkpoint and its training regime.

Check Yourself

mathQ1

A model has context length 4096. Your prompt tokenizes to 3500 tokens. How many tokens can the model generate?

40965963500

conceptualQ2

Why can a long prompt make an answer slow even if you only ask for 20 new tokens?

Because generation speed depends only on the requested output lengthBecause the model must process the full prompt first and then carry that context while decodingBecause the tokenizer becomes much slower for long prompts

shapeQ3

You have an 8,192-token context window, a 7,900-token prompt, and you request 600 output tokens. What has to happen?

The request fits exactly because generation is counted separatelyThe runtime must reduce output, reject the request, or remove context because only 292 tokens remainThe tokenizer compresses the prompt automatically to make room