M7/Performance
L32

Where Time Goes in an LLM

14 min
Where does time go during inference?

An LLM inference request runs through dozens of layers, each containing several operators: attention projections (Q, K, V, O), attention score computation, FFN up/gate/down projections, normalization, and finally an output-head projection. Not all operators cost the same amount of time.

The matrix multiplications in the projections dominate. Attention projections, FFN projections, and the output head together account for the vast majority of compute. Normalization and elementwise operations (residual adds, activation functions) are comparatively cheap.

But the mix changes between prefill and decode (the two phases you learned in L27–L28). During prefill, the model processes many tokens at once, so all projections run as large matrix-matrix multiplies (GEMM) and attention score computation over the full prompt can also be significant. During decode, only one token is processed per step, so projections become matrix-vector multiplies (GEMV) and the attention score computation is much smaller per step. The relative cost of loading model weights from memory becomes more important in decode.

Rough operator time breakdown for a single layer (illustrative):

Prefill (512 tokens):
QKV projections: 30%
Attention scores + softmax: 15%
Output projection: 10%
FFN (up + gate + down): 40%
Norm + residual: 5%
Decode (1 token):
QKV projections: 25%
Attention (KV cache read): 20%
Output projection: 10%
FFN (up + gate + down): 40%
Norm + residual: 5%

Exact percentages vary by model and hardware, but the pattern is consistent: projections and FFN dominate both phases.

A projection is a matrix multiply: [n_tokens, d_in] × [d_in, d_out]. The number of multiply-add operations is n_tokens × d_in × d_out. For a single FFN up-projection with d_model = 4096 and d_ff = 16384:

1 token: 2 × 4096 × 16384 = 134M FLOPs  (2 for multiply-add)
512 tokens: 512 × 2 × 4096 × 16384 = 68.7B FLOPs

One layer has multiple projections: Q, K, V, output (attention side) plus up, gate, down (FFN side). That is 7 large matrix multiplies per layer. A 32-layer model does 224 matrix multiplies per forward pass. The normalization, residual adds, and activation functions are elementwise — they touch the same data but do far fewer operations per element.

This is why performance optimization for LLMs is largely about optimizing matrix multiplies. If you can make ggml_mul_mat 20% faster, you speed up the entire model by close to 20%.

When you profile an LLM inference run, you will see a list of operators with their execution times. The right way to read it:

  1. Look at the distribution first. What fraction of time is in matrix multiplies vs everything else? If matmuls are 90%, optimizing softmax (3%) is a waste of effort.
  2. Classify the bottleneck. Is the time spent doing arithmetic (compute-bound) or waiting for data (memory-bound)? The answer depends on whether this is prefill or decode and on the shapes involved.
  3. Then drill into individual operators. Only after you know the bottleneck class does it make sense to look at which specific matmul is slowest.

The most common beginner mistake: looking at the single hottest function and assuming "that is the problem." Often the hottest function is doing exactly what it should — it is just the largest operation. The real question is whether it is running at the hardware's capability for that operation type.

Each projection: [n_tokens, d_in] × [d_in, d_out]
Prefill: n_tokens is large → big GEMM
Decode: n_tokens = 1 → GEMV
The same weight matrix is used in both phases, but the workload shape changes completely.

Total FLOPs per layer (approximate, for one token in decode):

QKV projections: 3 × 2 × d_model × d_model
Output projection: 2 × d_model × d_model
FFN (SwiGLU): 3 × 2 × d_model × d_ffn
For a 7B model: d_model ~ 4096, d_ffn ~ 11008 → FFN alone is ~271M FLOPs per token per layer.

In llama.cpp, each layer's operators are built in src/llama-graph.cpp via functions like build_attn() and build_ffn(). Profiling with GGML_PERF or external tools reveals that ggml_mul_mat calls dominate the trace in both prefill and decode.

Knowing where time goes tells you where to look first. If prefill is slow, the large GEMM projections and attention-score computation are the suspects. If decode is slow, model-weight loading (memory bandwidth) for each projection is usually the bottleneck since only one token is processed at a time.

Check Yourself
reasoningQ1

During prefill with a long prompt, which operators are likely to dominate wall-clock time?

reasoningQ2

Why does the operator mix differ between prefill and decode?