Serving Adds Scheduling, Not Just Math
A benchmark runs one request at a time: prefill the prompt, decode tokens, measure speed, done. A server handles many requests arriving at different times, with different prompt lengths, finishing at different times.
Continuous batching is the key scheduling technique. Instead of waiting for all requests in a batch to finish before starting new ones (static batching), the server inserts new requests into the running batch as soon as slots open up. A request that finishes early frees its slot immediately for a new arrival.
This means at any given moment, the GPU might be doing prefill for one request and decode for three others — simultaneously. The mix of prefill and decode tokens in the same batch changes the compute profile. The server must balance throughput (total tokens per second across all users) against latency (time each user waits for their tokens).
A setting that maximizes throughput in a benchmark (e.g., huge batch size, large ubatch) may cause unacceptable latency spikes in serving, because a long prefill can stall ongoing decode requests.
Server with capacity for 4 concurrent requests:
R2 finishes and R5 immediately takes its slot. Prefill and decode run in the same batch. No idle waiting.
Mixing prefill and decode in the same batch creates a tension. Prefill involves large matrix multiplies that keep the GPU busy for a long time. Decode tokens are tiny operations that complete quickly. When a new request arrives and needs prefill, the ongoing decode tokens for existing requests are delayed until the prefill computation finishes.
This is called prefill interference. From the user's perspective: they are mid-conversation, tokens are streaming smoothly, then suddenly there is a pause — because someone else's long prompt just started prefilling on the same GPU. The decode tokens had to wait.
Servers manage this by chunking prefill into smaller pieces (using the ubatch mechanism from L30). Instead of prefilling a 4,000-token prompt in one shot, the server might process 512 tokens at a time, interleaving decode steps for other requests between chunks. This caps the maximum stall time but extends total prefill duration.
The tradeoff is: throughput vs latency fairness. Large prefill chunks maximize GPU utilization. Small chunks keep decode latency stable for all users. There is no single right answer — it depends on the serving scenario.
A benchmark runs one request: the GPU processes only that one sequence, with no contention. The optimal setting is maximum batch size, maximum ubatch, and all resources devoted to one computation.
In serving, the GPU handles many concurrent requests. If you use the benchmark-optimal settings, a single long prefill blocks all decode slots, causing latency spikes. The "fastest" setting in a benchmark can be the "most unfair" setting in production.
This is why performance reasoning requires knowing the workload, not just the model. The same model on the same hardware behaves very differently depending on whether you are running one request or fifty.
No new formulas. The core difference is a scheduling policy:
The llama.cpp server (llama-server) implements continuous batching. It maintains a pool of slots, each tracking one active request's position in the KV cache. When a request completes, its KV cache slot is freed and reassigned. The --parallel flag controls how many concurrent slots are available.
In benchmarks, you optimize for one request: large batch size, all compute for one sequence. In serving, you split resources across many concurrent requests. A large prefill for a new request can stall decode for existing requests ("prefill interference"). Servers may limit prefill chunk size to keep decode latency stable. The best benchmark setting (maximize single-request throughput) is often a poor serving setting (causes latency spikes for other users).
A server runs 8 requests in a static batch. Request 1 finishes after 20 tokens, but request 8 needs 500 tokens. What happens to request 1's GPU slot while request 8 is still generating?
A server chunks a 10,000-token prefill into 500-token pieces, interleaving decode steps for other users between chunks. Compared to prefilling all 10,000 tokens at once, what changes?