L31

Long Context Changes Which Costs Grow

15 min

Question

Why does longer context change costs?

Intuition

Not all parts of the model scale the same way with sequence length. When you double the context, some costs double and some more than double. Understanding which is which lets you predict where the bottleneck will be.

FFN cost scales linearly with token count. Each token is processed independently, so twice as many tokens means twice the FFN work. Simple.

Attention cost during prefill scales quadratically with sequence length. Each token attends to all tokens before it, so the total attention work grows as n². During decode, attention cost per step grows linearly with n_past (the cached sequence length), because the new token attends to all previous tokens.

KV cache memory scales linearly with sequence length. Twice the context means twice the cache memory.

This means a 32,000-token prompt is not just "16,000 tokens twice." The attention computation during prefill is roughly 4x that of a 16,000-token prompt (quadratic), while the FFN work is only 2x (linear). The bottleneck shifts depending on the context length.

Toy Example

Comparing costs at 1k, 4k, and 16k context:

1k tokens 4k tokens 16k tokens FFN 1x 4x 16x Attn (prefill) 1x 16x 256x KV cache 1x 4x 16x

At 1k, FFN dominates. At 16k, attention prefill dominates. The bottleneck shifts with context length.

Shapes

FFN work per layer: O(n × d_model × d_ff)

Attention work per layer (prefill): O(n² × d_model) (summed across all heads)

KV cache memory: O(n × n_layers × n_kv_heads × d_head)

n² for attention vs n for everything else. This divergence drives long-context behavior.

Math

The scaling difference comes from the attention score matrix:

Attention scores: Q · Kᵀ = [n, d_head] · [d_head, n] = [n, n]

n² entries. Double n → 4x entries.

FFN: X · W = [n, d_model] · [d_model, d_ff] = [n, d_ff]

n × d_ff entries. Double n → 2x entries.

Implementation Hook

In llama.cpp, the context size is set with -c. Flash attention (-fa) reduces the memory footprint of attention by not materializing the full [n, n] score matrix, but the compute still scales quadratically. Chunked prefill (controlled by ubatch size) helps manage peak memory when processing very long prompts.

common/arg.cpp — context size and flash attention flags

What This Means for Deployment

Context scaling is not just a theoretical concern — it directly determines what you can deploy:

KV cache is the memory gate. A 32-layer model with 32 KV heads, d_head=128, and FP16 stores ~512 KB per token in KV cache. At 128K context, that is ~64 GB of cache alone — potentially exceeding GPU memory. This is why architectures like GQA and SWA exist: GQA reduces the per-token cache size (fewer KV heads), while SWA bounds the total cache by evicting entries beyond the window.
Attention compute is the prefill gate. At 128K context, the attention score matrices are 128K × 128K = 16 billion entries per head per layer. This makes prefill slow — potentially many seconds for a long document. Flash attention reduces memory but not this compute cost.
The bottleneck shifts with context length. At short contexts, FFN projections dominate the per-layer cost. As context grows, attention's O(n²) cost overtakes FFN and becomes the bottleneck during prefill. During decode, KV cache reads dominate regardless of context length. Knowing which regime you are in determines which optimizations matter.

Performance Hook

At short contexts (under ~2K), FFN projections dominate per-layer cost during prefill, and with enough tokens the GEMMs are still compute-bound on GPUs. As context grows past ~8K, attention's quadratic cost overtakes FFN and becomes the main compute bottleneck during prefill. During decode, the workload is memory-bandwidth-bound at any context length. The optimization strategy depends on which regime dominates your workload.

Check Yourself

conceptualQ1

Why is a 32k-token prompt not just "16k tokens twice"?

Because the model uses different weights for longer promptsBecause attention scales quadratically — doubling context roughly quadruples attention work during prefill, while FFN work only doublesBecause the tokenizer produces different tokens for longer texts

shapeQ2

If you double the sequence length, how does KV cache memory change (in terms of used cache entries)?

It quadruples (4x)It doubles (2x)It stays the same — cache size is fixed at allocation