M6/Inference Mechanics
L31

Long Context Changes Which Costs Grow

15 min
Why does longer context change costs?

Not all parts of the model scale the same way with sequence length. When you double the context, some costs double and some more than double. Understanding which is which lets you predict where the bottleneck will be.

FFN cost scales linearly with token count. Each token is processed independently, so twice as many tokens means twice the FFN work. Simple.

Attention cost during prefill scales quadratically with sequence length. Each token attends to all tokens before it, so the total attention work grows as n². During decode, attention cost per step grows linearly with n_past (the cached sequence length), because the new token attends to all previous tokens.

KV cache memory scales linearly with sequence length. Twice the context means twice the cache memory.

This means a 32,000-token prompt is not just "16,000 tokens twice." The attention computation during prefill is roughly 4x that of a 16,000-token prompt (quadratic), while the FFN work is only 2x (linear). The bottleneck shifts depending on the context length.

Comparing costs at 1k, 4k, and 16k context:

1k tokens 4k tokens 16k tokens FFN 1x 4x 16x Attn (prefill) 1x 16x 256x KV cache 1x 4x 16x

At 1k, FFN dominates. At 16k, attention prefill dominates. The bottleneck shifts with context length.

FFN work per layer: O(n × d_model × d_ff)
Attention work per layer (prefill): O(n² × d_model)   (summed across all heads)
KV cache memory: O(n × n_layers × n_kv_heads × d_head)
n² for attention vs n for everything else. This divergence drives long-context behavior.

The scaling difference comes from the attention score matrix:

Attention scores: Q · Kᵀ = [n, d_head] · [d_head, n] = [n, n]
n² entries. Double n → 4x entries.
FFN: X · W = [n, d_model] · [d_model, d_ff] = [n, d_ff]
n × d_ff entries. Double n → 2x entries.

In llama.cpp, the context size is set with -c. Flash attention (-fa) reduces the memory footprint of attention by not materializing the full [n, n] score matrix, but the compute still scales quadratically. Chunked prefill (controlled by ubatch size) helps manage peak memory when processing very long prompts.

Context scaling is not just a theoretical concern — it directly determines what you can deploy:

  • KV cache is the memory gate. A 32-layer model with 32 KV heads, d_head=128, and FP16 stores ~512 KB per token in KV cache. At 128K context, that is ~64 GB of cache alone — potentially exceeding GPU memory. This is why architectures like GQA and SWA exist: GQA reduces the per-token cache size (fewer KV heads), while SWA bounds the total cache by evicting entries beyond the window.
  • Attention compute is the prefill gate. At 128K context, the attention score matrices are 128K × 128K = 16 billion entries per head per layer. This makes prefill slow — potentially many seconds for a long document. Flash attention reduces memory but not this compute cost.
  • The bottleneck shifts with context length. At short contexts, FFN projections dominate the per-layer cost. As context grows, attention's O(n²) cost overtakes FFN and becomes the bottleneck during prefill. During decode, KV cache reads dominate regardless of context length. Knowing which regime you are in determines which optimizations matter.

At short contexts (under ~2K), FFN projections dominate per-layer cost during prefill, and with enough tokens the GEMMs are still compute-bound on GPUs. As context grows past ~8K, attention's quadratic cost overtakes FFN and becomes the main compute bottleneck during prefill. During decode, the workload is memory-bandwidth-bound at any context length. The optimization strategy depends on which regime dominates your workload.

Check Yourself
conceptualQ1

Why is a 32k-token prompt not just "16k tokens twice"?

shapeQ2

If you double the sequence length, how does KV cache memory change (in terms of used cache entries)?