M6/Inference Mechanics
L30

Batch Size and Ubatch Size Change Execution Policy

16 min
What does batch size actually control?

The model's math — attention, FFN, output head — is defined for a sequence of tokens. It does not specify how many tokens you must process in one call. That is an execution policy choice, not a model behavior choice.

Batch size (n_batch) is the maximum number of tokens accepted per call to the inference engine. If your prompt has 2,000 tokens and batch size is 512, the engine splits prefill into multiple calls of up to 512 tokens each.

Ubatch size (n_ubatch, for "micro-batch") is a further subdivision. Within each batch, the engine may split work into smaller physical chunks for the GPU. This controls memory usage and kernel scheduling without changing what the model computes.

The critical insight: changing batch or ubatch size changes throughput and memory usage, but the mathematical output is identical. The model produces the same logits regardless of how you chunk the work.

If larger batches mean larger (faster) matrix multiplies, why not set batch = prompt length and do everything in one shot?

  • Memory. Larger batches mean larger intermediate tensors (attention scores, FFN activations). For a 100K-token prompt processed as one batch, the attention score matrix would be [100K, 100K] per head — about 20 GB in FP16 per head, multiplied across many heads and layers. Chunking into smaller ubatches caps the peak memory.
  • Latency fairness. In serving (covered next), a huge prefill batch blocks decode for other users. Smaller batches let the server interleave prefill and decode, keeping all users' streams flowing.
  • Diminishing returns. Past a certain size, the matrix multiplies are already large enough to saturate the hardware. Making them bigger does not help — it just uses more memory.

Prompt: 8 tokens. Batch size: 4. Ubatch size: 2.

Logical prompt: [t1, t2, t3, t4, t5, t6, t7, t8]
batch 1: [t1, t2, t3, t4]
ubatch 1a: [t1, t2]
ubatch 1b: [t3, t4]
batch 2: [t5, t6, t7, t8]
ubatch 2a: [t5, t6]
ubatch 2b: [t7, t8]

Same 8 tokens, same result. Different chunking changes how much GPU memory is used at once.

Per ubatch input: [n_ubatch, d_model]
Per batch input: [n_batch, d_model]
Full prompt: [n_prompt, d_model]
n_ubatch ≤ n_batch ≤ n_prompt. Shapes change, math does not.

The math is unchanged. Batching is a chunking of the same operation:

// Full prefill (conceptual):
H_out = layers(X_[1:n])   // all tokens at once
// Batched prefill (actual):
H_1 = layers(X_[1:b])   // first chunk
H_2 = layers(X_[b+1:2b]) // second chunk
Same final result. KV cache accumulates across chunks.

In llama.cpp, n_batch and n_ubatch are set in the context parameters. The engine automatically splits large prompts into chunks of these sizes. Lowering n_ubatch reduces peak memory for activations; raising it improves GPU utilization during prefill.

Larger ubatch sizes give the GPU bigger matrix multiplies, improving compute utilization during prefill. But larger ubatch sizes also require more memory for intermediate activations. The sweet spot depends on your GPU's memory and compute capacity. During decode, batch size matters less because you are processing one token regardless.

Check Yourself
reasoningQ1

You double n_ubatch from 256 to 512. Prefill now processes the same 1024-token prompt in 2 chunks instead of 4. What changes?

mathQ2

If n_ubatch is set to 256 and the prompt has 1024 tokens, how many ubatch evaluations occur during prefill?