M6/Inference Mechanics
L27

Prefill Processes the Prompt

14 min
What happens during prefill?

When you send a prompt to the model, all prompt tokens are known up front. The model does not need to generate them one at a time — it already has the full sequence. This lets the model process the entire prompt as one logical phase (which may be split into chunks, as you will learn in L30).

This phase is called prefill. Every prompt token flows through every layer: attention computes relationships between all prompt positions, and the FFN transforms each position. Because all tokens are available simultaneously, the Q, K, V, and FFN projections are large matrix multiplies — the [n_tokens, d_model] × [d_model, d_out] pattern you learned in L08, but with n_tokens potentially in the thousands.

Prefill is compute-bound. The GPU has big matrices to chew through, and the arithmetic units stay busy. Prefill produces key and value vectors (the KV cache, explained in L29) that will be reused during token-by-token generation (decode, explained in L28).

Prompt: "The cat sat on" (4 tokens). Prefill processes all 4 at once:

input: [The, cat, sat, on] — 4 tokens, all at once
attention — Q·Kᵀ is [4, 4], all pairs computed
FFN — 4 rows multiplied together as one batch
output: hidden states for all 4 positions + KV cache filled

One big forward pass. The larger the prompt, the larger the matrix multiplies.

Input to each layer during prefill: [n_prompt, d_model]
Attention Q·Kᵀ per head: [n_prompt, n_prompt]
FFN input: [n_prompt, d_model]
All dimensions scale with n_prompt — this is why prefill uses large matrix multiplies.

During prefill, every weight matrix is multiplied by a matrix with n_prompt rows — not a single vector. Consider the Q projection: W_Q has shape [d_model, d_head]. With n_prompt = 1,000, the multiply is [1000, d_model] × [d_model, d_head]. The weight matrix is loaded once from memory but used 1,000 times (once per token row). This high reuse ratio means the hardware's compute units stay busy — the bottleneck is arithmetic, not memory bandwidth.

The attention score matrix makes this even more compute-heavy: Q · KT produces an [n_prompt, n_prompt] matrix. For a 4,000-token prompt, that is 16 million entries per head, each requiring a d_head-dimensional dot product. This quadratic cost is why long prompts have noticeably slow prefill.

From the user's perspective, prefill determines the time-to-first-token (TTFT) — the delay between sending the prompt and seeing the first output token. During this time, the model is processing the entire prompt through all layers, filling the KV cache, and computing the logits for the first generated position.

TTFT grows with prompt length: a 500-token prompt might take 50ms, while a 50,000-token prompt might take 5 seconds or more. This is why long system prompts, lengthy documents, or chat histories with extensive context can feel sluggish — the user is waiting for prefill to complete before any output streams back.

Some serving systems split long prefills into chunks (using the ubatch mechanism from L30) to reduce memory peaks, but the total compute work is the same regardless of chunking.

The key insight is that prefill computes attention for all prompt positions at once:

Q = X · W_Q   // [n_prompt, d_head]
K = X · W_K   // [n_prompt, d_head]
scores = Q · Kᵀ // [n_prompt, n_prompt]
n_prompt rows in Q × n_prompt columns in Kᵀ = large GEMM

In llama.cpp, prefill and decode share the same computation graph but with different input sizes. During prefill, the batch contains prompt tokens, producing large matrix multiplies that saturate GPU compute. Long prompts may require multiple llama_decode() calls and are internally split into ubatches depending on batch size settings.

Prefill is usually compute-bound: the GPU spends most of its time on arithmetic, not waiting for memory. This means prefill benefits from faster compute (higher FLOPS) more than from faster memory bandwidth. A 2,000-token prompt creates matrix multiplies large enough to keep GPU cores busy — a 200-token prompt may not.

Check Yourself
reasoningQ1

During prefill for a 2,000-token prompt, the attention score matrix for one head has shape [2000, 2000]. During decode of the next token, the same head computes a score vector of shape [1, 2001]. Which phase is more likely to be compute-bound, and why?

conceptualQ2

What is the primary bottleneck during prefill for a long prompt?