L27

Prefill Processes the Prompt

14 min

Question

What happens during prefill?

Intuition

When you send a prompt to the model, all prompt tokens are known up front. The model does not need to generate them one at a time — it already has the full sequence. This lets the model process the entire prompt as one logical phase (which may be split into chunks, as you will learn in L30).

This phase is called prefill. Every prompt token flows through every layer: attention computes relationships between all prompt positions, and the FFN transforms each position. Because all tokens are available simultaneously, the Q, K, V, and FFN projections are large matrix multiplies — the [n_tokens, d_model] × [d_model, d_out] pattern you learned in L08, but with n_tokens potentially in the thousands.

Prefill is compute-bound. The GPU has big matrices to chew through, and the arithmetic units stay busy. Prefill produces key and value vectors (the KV cache, explained in L29) that will be reused during token-by-token generation (decode, explained in L28).

Toy Example

Prompt: "The cat sat on" (4 tokens). Prefill processes all 4 at once:

input: [The, cat, sat, on] — 4 tokens, all at once

attention — Q·Kᵀ is [4, 4], all pairs computed

FFN — 4 rows multiplied together as one batch

output: hidden states for all 4 positions + KV cache filled

One big forward pass. The larger the prompt, the larger the matrix multiplies.

Shapes

Input to each layer during prefill: [n_prompt, d_model]

Attention Q·Kᵀ per head: [n_prompt, n_prompt]

FFN input: [n_prompt, d_model]

All dimensions scale with n_prompt — this is why prefill uses large matrix multiplies.

Why Prefill Is Compute-Bound

During prefill, every weight matrix is multiplied by a matrix with n_prompt rows — not a single vector. Consider the Q projection: W_Q has shape [d_model, d_head]. With n_prompt = 1,000, the multiply is [1000, d_model] × [d_model, d_head]. The weight matrix is loaded once from memory but used 1,000 times (once per token row). This high reuse ratio means the hardware's compute units stay busy — the bottleneck is arithmetic, not memory bandwidth.

The attention score matrix makes this even more compute-heavy: Q · K^T produces an [n_prompt, n_prompt] matrix. For a 4,000-token prompt, that is 16 million entries per head, each requiring a d_head-dimensional dot product. This quadratic cost is why long prompts have noticeably slow prefill.

Time-to-First-Token (TTFT)

From the user's perspective, prefill determines the time-to-first-token (TTFT) — the delay between sending the prompt and seeing the first output token. During this time, the model is processing the entire prompt through all layers, filling the KV cache, and computing the logits for the first generated position.

TTFT grows with prompt length: a 500-token prompt might take 50ms, while a 50,000-token prompt might take 5 seconds or more. This is why long system prompts, lengthy documents, or chat histories with extensive context can feel sluggish — the user is waiting for prefill to complete before any output streams back.

Some serving systems split long prefills into chunks (using the ubatch mechanism from L30) to reduce memory peaks, but the total compute work is the same regardless of chunking.

Math

The key insight is that prefill computes attention for all prompt positions at once:

Q = X · W_Q // [n_prompt, d_head]

K = X · W_K // [n_prompt, d_head]

scores = Q · Kᵀ // [n_prompt, n_prompt]

n_prompt rows in Q × n_prompt columns in Kᵀ = large GEMM

Implementation Hook

In llama.cpp, prefill and decode share the same computation graph but with different input sizes. During prefill, the batch contains prompt tokens, producing large matrix multiplies that saturate GPU compute. Long prompts may require multiple llama_decode() calls and are internally split into ubatches depending on batch size settings.

src/llama-context.cpp — llama_decode() (L3432)

Performance Hook

Prefill is usually compute-bound: the GPU spends most of its time on arithmetic, not waiting for memory. This means prefill benefits from faster compute (higher FLOPS) more than from faster memory bandwidth. A 2,000-token prompt creates matrix multiplies large enough to keep GPU cores busy — a 200-token prompt may not.

Check Yourself

reasoningQ1

During prefill for a 2,000-token prompt, the attention score matrix for one head has shape [2000, 2000]. During decode of the next token, the same head computes a score vector of shape [1, 2001]. Which phase is more likely to be compute-bound, and why?

Prefill — computing the [2000, 2000] score matrix requires millions of dot products, keeping compute units saturatedDecode — the [1, 2001] score vector still reads all cached keys, which is more work overallBoth are equally compute-bound because they use the same weight matrices

conceptualQ2

What is the primary bottleneck during prefill for a long prompt?

Memory bandwidth — reading the KV cacheCompute — arithmetic on large matricesNetwork latency — waiting for tokens to arrive