L30A

Serving Adds Scheduling, Not Just Math

16 min

Question

Why does serving behave differently from benchmarks?

Intuition

A benchmark runs one request at a time: prefill the prompt, decode tokens, measure speed, done. A server handles many requests arriving at different times, with different prompt lengths, finishing at different times.

Continuous batching is the key scheduling technique. Instead of waiting for all requests in a batch to finish before starting new ones (static batching), the server inserts new requests into the running batch as soon as slots open up. A request that finishes early frees its slot immediately for a new arrival.

This means at any given moment, the GPU might be doing prefill for one request and decode for three others — simultaneously. The mix of prefill and decode tokens in the same batch changes the compute profile. The server must balance throughput (total tokens per second across all users) against latency (time each user waits for their tokens).

A setting that maximizes throughput in a benchmark (e.g., huge batch size, large ubatch) may cause unacceptable latency spikes in serving, because a long prefill can stall ongoing decode requests.

Toy Example

Server with capacity for 4 concurrent requests:

t=0: [R1 prefill] [R2 prefill] [empty] [empty]

t=1: [R1 decode] [R2 decode] [R3 prefill] [empty]

t=2: [R1 decode] [R2 done] [R3 decode] [R4 prefill]

t=3: [R1 decode] [R5 prefill] [R3 decode] [R4 decode]

R2 finishes and R5 immediately takes its slot. Prefill and decode run in the same batch. No idle waiting.

The Prefill Interference Problem

Mixing prefill and decode in the same batch creates a tension. Prefill involves large matrix multiplies that keep the GPU busy for a long time. Decode tokens are tiny operations that complete quickly. When a new request arrives and needs prefill, the ongoing decode tokens for existing requests are delayed until the prefill computation finishes.

This is called prefill interference. From the user's perspective: they are mid-conversation, tokens are streaming smoothly, then suddenly there is a pause — because someone else's long prompt just started prefilling on the same GPU. The decode tokens had to wait.

Servers manage this by chunking prefill into smaller pieces (using the ubatch mechanism from L30). Instead of prefilling a 4,000-token prompt in one shot, the server might process 512 tokens at a time, interleaving decode steps for other requests between chunks. This caps the maximum stall time but extends total prefill duration.

The tradeoff is: throughput vs latency fairness. Large prefill chunks maximize GPU utilization. Small chunks keep decode latency stable for all users. There is no single right answer — it depends on the serving scenario.

Why Benchmark Settings Mislead

A benchmark runs one request: the GPU processes only that one sequence, with no contention. The optimal setting is maximum batch size, maximum ubatch, and all resources devoted to one computation.

In serving, the GPU handles many concurrent requests. If you use the benchmark-optimal settings, a single long prefill blocks all decode slots, causing latency spikes. The "fastest" setting in a benchmark can be the "most unfair" setting in production.

This is why performance reasoning requires knowing the workload, not just the model. The same model on the same hardware behaves very differently depending on whether you are running one request or fifty.

Shapes

Mixed batch at time t: some tokens are prefill (many rows), some are decode (1 row each)

Effective batch shape: [n_prefill + n_decode_requests, d_model]

The shape of the combined batch changes every step as requests arrive and complete.

Key Concept

No new formulas. The core difference is a scheduling policy:

// Static batching:

wait until all N requests arrive → process together → wait until all finish

// Continuous batching:

insert request as slot opens → evict when done → fill slot immediately

Throughput ↑ but scheduling overhead and latency variance also ↑

Implementation Hook

The llama.cpp server (llama-server) implements continuous batching. It maintains a pool of slots, each tracking one active request's position in the KV cache. When a request completes, its KV cache slot is freed and reassigned. The --parallel flag controls how many concurrent slots are available.

tools/server/server.cpp — slot management

Performance Hook

In benchmarks, you optimize for one request: large batch size, all compute for one sequence. In serving, you split resources across many concurrent requests. A large prefill for a new request can stall decode for existing requests ("prefill interference"). Servers may limit prefill chunk size to keep decode latency stable. The best benchmark setting (maximize single-request throughput) is often a poor serving setting (causes latency spikes for other users).

Check Yourself

reasoningQ1

A server runs 8 requests in a static batch. Request 1 finishes after 20 tokens, but request 8 needs 500 tokens. What happens to request 1's GPU slot while request 8 is still generating?

It is immediately filled by a new requestIt sits idle until all 8 requests finish — the GPU wastes compute on empty slotsIt is used to speed up request 8 by giving it extra compute

reasoningQ2

A server chunks a 10,000-token prefill into 500-token pieces, interleaving decode steps for other users between chunks. Compared to prefilling all 10,000 tokens at once, what changes?

Total prefill compute decreases because smaller chunks are more efficientTotal prefill compute stays the same, but other users' decode latency improves because they get GPU time between chunksTotal prefill compute increases, and other users' decode latency also increases