M8/Case Studies
L42

Case Study: Interpreting a Prefill vs Decode Profile

20 min
What does a real profile tell you?

A profile shows you where the runtime actually goes — but reading one correctly is a skill. The most common mistake is jumping to a local conclusion from the single hottest function. Instead, you need to look at the distribution of time across operator types and ask what that distribution tells you about the system's bottleneck.

Prefill and decode have fundamentally different profiles:

  • Prefill Processes many tokens at once. Large matrix multiplies dominate. Compute-bound. The hotspot is typically in GEMM (general matrix multiply) kernels.
  • Decode Processes one token at a time. Each operation loads large weight matrices for a tiny computation. Memory-bandwidth-bound. The hotspot is weight loading, not arithmetic.

A profile that shows 80% of time in matrix multiply kernels during prefill is normal — that is the GPU doing useful compute. The same profile during decode might mean the kernel is not the bottleneck; the time is spent waiting for data to arrive from memory.

The key habit: look at the distribution first, hypothesize the bottleneck class (compute vs memory vs overhead), then drill into individual functions. Never start with the hottest function and assume that is "the problem."

A profile is a table of functions (or operations) and the time spent in each. Here is a systematic approach to interpreting one:

  1. Classify the mode. Is this a prefill profile or a decode profile? Check the batch size or token count. If you do not know, the profile distribution itself tells you: if GEMM dominates (>60%), it is likely prefill. If GEMV + memory ops dominate, it is likely decode.
  2. Sum the categories. Group operations into compute (GEMM/GEMV), memory (KV cache reads/writes, weight loads), normalization (RMSNorm, softmax), and overhead (kernel launch, sync, memory allocation). The category totals reveal the bottleneck class.
  3. Check the overhead fraction. In a healthy profile, overhead should be <15% for prefill and <25% for decode. High overhead (>30%) suggests problems: too many small kernel launches, excessive synchronization, or memory allocation during inference.
  4. Compare against expected counts. Derive the expected kernel count from your model's block structure: count the matmuls per layer (Q/K/V projections, output projection, FFN matrices) and multiply by the number of layers. If the profile shows significantly more calls than expected, investigate — something may be computed redundantly. If certain layers take much longer than others, check for heterogeneous config (MoE, different attention type).
  5. Only then look at individual hotspots. Now that you understand the big picture, drill into the top 3-5 functions. Ask: is this function slow because of its inherent cost, or because it is being called more often than expected?

After reading many profiles, certain patterns become recognizable:

  • Healthy prefill: 65-80% GEMM, 5-10% softmax, <15% overhead. The compute units are busy doing useful work. Optimization targets: GEMM kernel efficiency, tiling strategy.
  • Healthy decode: 50-65% GEMV, 15-20% KV cache, <25% overhead. Memory bandwidth is saturated. Optimization targets: quantization (reduce bytes loaded), KV cache layout.
  • Overhead-dominated: >30% in sync/launch/alloc. This is pathological — the hardware is mostly idle. Causes: too-small batch sizes, un-fused operations creating many tiny kernels, memory allocation during the hot loop instead of pre-allocation.
  • Softmax-heavy prefill: >15% in softmax. This usually means very long sequences — the softmax cost scales with n_tokens² while projection GEMMs are linear in n_tokens (for fixed model width). At extreme sequence lengths, attention score computation and softmax can rival the projection GEMMs.

A profile tells you where the time goes. It does not directly tell you what to do. The translation depends on the bottleneck class:

Compute-bound (prefill):
→ Better SIMD utilization, kernel tiling, fused operations
Memory-bound (decode):
→ Quantization, smaller data types, weight layout optimization, batching
Overhead-bound:
→ Operation fusion, pre-allocation, reducing kernel launch count

Applying the wrong category of optimization wastes effort. Optimizing SIMD instructions when you are memory-bound will have negligible effect — the ALUs are already waiting for data.

Two synthetic profiles from the same model, same hardware:

Prefill (512 tokens)
GEMM (Q*K, attn*V, projections) … 72%
Softmax ……………………… 8%
RMSNorm ……………………… 4%
Elementwise (RoPE, residual) …… 6%
Other (launch overhead, sync) …… 10%
Decode (1 token)
GEMV (projections, FFN) ………… 58%
KV cache read …………………… 18%
Softmax ………………………… 3%
Elementwise ……………………… 5%
Other (launch overhead, sync) …… 16%

Prefill: dominated by compute (GEMM). Decode: dominated by memory access (GEMV + KV cache reads) with high overhead fraction.

Prefill GEMM: [n_tokens, d_model] × [d_model, d_out]   (large n_tokens, compute-heavy)
Decode GEMV: [1, d_model] × [d_model, d_out]   (n_tokens=1, memory-heavy)
Same weight matrices, vastly different arithmetic intensity. This is why the bottleneck shifts.

Arithmetic intensity determines the bottleneck regime:

arithmetic_intensity = FLOPs / bytes_loaded
Prefill: high intensity (many tokens reuse the same weights) → compute-bound
Decode: low intensity (1 token, full weight load) → memory-bound

The distinction between prefill and decode is not explicit in a single file — it emerges from the batch size passed to the graph builder. When the batch contains many tokens (prefill), the matrix multiplies are large GEMMs. When it contains one token (decode), they become GEMVs. The same graph code runs both cases; only the shapes change.

If your decode profile shows high "other/overhead" percentage (kernel launch, synchronization), that often means the individual operations are so fast (tiny matrices) that launch overhead becomes a significant fraction. This is a common signal that operation fusion or batching could help — but it requires looking at the distribution, not just the hottest kernel.

Check Yourself
conceptualQ1

A decode profile shows 58% in GEMV and 18% in KV cache reads. What bottleneck class does this suggest?

conceptualQ2

Why is the same model compute-bound during prefill but memory-bound during decode?

conceptualQ3

A prefill profile shows: GEMM 45%, softmax 5%, RMSNorm 3%, elementwise 4%, other/overhead 43%. What does this suggest?