L30

Batch Size and Ubatch Size Change Execution Policy

16 min

Question

What does batch size actually control?

Intuition

The model's math — attention, FFN, output head — is defined for a sequence of tokens. It does not specify how many tokens you must process in one call. That is an execution policy choice, not a model behavior choice.

Batch size (n_batch) is the maximum number of tokens accepted per call to the inference engine. If your prompt has 2,000 tokens and batch size is 512, the engine splits prefill into multiple calls of up to 512 tokens each.

Ubatch size (n_ubatch, for "micro-batch") is a further subdivision. Within each batch, the engine may split work into smaller physical chunks for the GPU. This controls memory usage and kernel scheduling without changing what the model computes.

The critical insight: changing batch or ubatch size changes throughput and memory usage, but the mathematical output is identical. The model produces the same logits regardless of how you chunk the work.

Why Not Always Use the Largest Batch?

If larger batches mean larger (faster) matrix multiplies, why not set batch = prompt length and do everything in one shot?

Memory. Larger batches mean larger intermediate tensors (attention scores, FFN activations). For a 100K-token prompt processed as one batch, the attention score matrix would be [100K, 100K] per head — about 20 GB in FP16 per head, multiplied across many heads and layers. Chunking into smaller ubatches caps the peak memory.
Latency fairness. In serving (covered next), a huge prefill batch blocks decode for other users. Smaller batches let the server interleave prefill and decode, keeping all users' streams flowing.
Diminishing returns. Past a certain size, the matrix multiplies are already large enough to saturate the hardware. Making them bigger does not help — it just uses more memory.

Toy Example

Prompt: 8 tokens. Batch size: 4. Ubatch size: 2.

Logical prompt: [t1, t2, t3, t4, t5, t6, t7, t8]

batch 1: [t1, t2, t3, t4]

ubatch 1a: [t1, t2]

ubatch 1b: [t3, t4]

batch 2: [t5, t6, t7, t8]

ubatch 2a: [t5, t6]

ubatch 2b: [t7, t8]

Same 8 tokens, same result. Different chunking changes how much GPU memory is used at once.

Shapes

Per ubatch input: [n_ubatch, d_model]

Per batch input: [n_batch, d_model]

Full prompt: [n_prompt, d_model]

n_ubatch ≤ n_batch ≤ n_prompt. Shapes change, math does not.

Math

The math is unchanged. Batching is a chunking of the same operation:

// Full prefill (conceptual):

H_out = layers(X_[1:n]) // all tokens at once

// Batched prefill (actual):

H_1 = layers(X_[1:b]) // first chunk

H_2 = layers(X_[b+1:2b]) // second chunk

Same final result. KV cache accumulates across chunks.

Implementation Hook

In llama.cpp, n_batch and n_ubatch are set in the context parameters. The engine automatically splits large prompts into chunks of these sizes. Lowering n_ubatch reduces peak memory for activations; raising it improves GPU utilization during prefill.

common/arg.cpp — batch-size and ubatch-size flags

Performance Hook

Larger ubatch sizes give the GPU bigger matrix multiplies, improving compute utilization during prefill. But larger ubatch sizes also require more memory for intermediate activations. The sweet spot depends on your GPU's memory and compute capacity. During decode, batch size matters less because you are processing one token regardless.

Check Yourself

reasoningQ1

You double n_ubatch from 256 to 512. Prefill now processes the same 1024-token prompt in 2 chunks instead of 4. What changes?

The output logits change because larger chunks see more context at onceThroughput may improve (larger matrices, better hardware utilization) but the output logits are identicalMemory usage stays the same because the total number of tokens is unchanged

mathQ2

If n_ubatch is set to 256 and the prompt has 1024 tokens, how many ubatch evaluations occur during prefill?

1 — the engine processes everything at once4 — the prompt is split into 1024/256 chunks256 — one evaluation per ubatch-sized group