Long Context Changes Which Costs Grow
Not all parts of the model scale the same way with sequence length. When you double the context, some costs double and some more than double. Understanding which is which lets you predict where the bottleneck will be.
FFN cost scales linearly with token count. Each token is processed independently, so twice as many tokens means twice the FFN work. Simple.
Attention cost during prefill scales quadratically with sequence length. Each token attends to all tokens before it, so the total attention work grows as n². During decode, attention cost per step grows linearly with n_past (the cached sequence length), because the new token attends to all previous tokens.
KV cache memory scales linearly with sequence length. Twice the context means twice the cache memory.
This means a 32,000-token prompt is not just "16,000 tokens twice." The attention computation during prefill is roughly 4x that of a 16,000-token prompt (quadratic), while the FFN work is only 2x (linear). The bottleneck shifts depending on the context length.
Comparing costs at 1k, 4k, and 16k context:
At 1k, FFN dominates. At 16k, attention prefill dominates. The bottleneck shifts with context length.
The scaling difference comes from the attention score matrix:
In llama.cpp, the context size is set with -c. Flash attention (-fa) reduces the memory footprint of attention by not materializing the full [n, n] score matrix, but the compute still scales quadratically. Chunked prefill (controlled by ubatch size) helps manage peak memory when processing very long prompts.
Context scaling is not just a theoretical concern — it directly determines what you can deploy:
- KV cache is the memory gate. A 32-layer model with 32 KV heads, d_head=128, and FP16 stores ~512 KB per token in KV cache. At 128K context, that is ~64 GB of cache alone — potentially exceeding GPU memory. This is why architectures like GQA and SWA exist: GQA reduces the per-token cache size (fewer KV heads), while SWA bounds the total cache by evicting entries beyond the window.
- Attention compute is the prefill gate. At 128K context, the attention score matrices are 128K × 128K = 16 billion entries per head per layer. This makes prefill slow — potentially many seconds for a long document. Flash attention reduces memory but not this compute cost.
- The bottleneck shifts with context length. At short contexts, FFN projections dominate the per-layer cost. As context grows, attention's O(n²) cost overtakes FFN and becomes the bottleneck during prefill. During decode, KV cache reads dominate regardless of context length. Knowing which regime you are in determines which optimizations matter.
At short contexts (under ~2K), FFN projections dominate per-layer cost during prefill, and with enough tokens the GEMMs are still compute-bound on GPUs. As context grows past ~8K, attention's quadratic cost overtakes FFN and becomes the main compute bottleneck during prefill. During decode, the workload is memory-bandwidth-bound at any context length. The optimization strategy depends on which regime dominates your workload.
Why is a 32k-token prompt not just "16k tokens twice"?
If you double the sequence length, how does KV cache memory change (in terms of used cache entries)?