Sliding-Window Attention Limits What a Layer Can See
Standard causal attention lets every token attend to all previous tokens. This means the attention matrix grows with the square of the sequence length. For very long contexts, this cost becomes prohibitive.
Sliding-window attention (SWA) restricts each token to attending only to the most recent w tokens, where w is the window size. Tokens beyond the window are masked out, as if they do not exist for that layer.
A single SWA layer cannot see beyond its window. But when you stack multiple SWA layers, the effective receptive field grows: each additional layer extends the indirect reach by another w-1 positions through the residual stream.
Modern architectures often interleave local SWA layers with occasional full-attention layers. The local layers handle nearby context cheaply, while the full-attention layers maintain long-range connections.
A single SWA layer with window w can only access w positions (including itself). But information in a transformer propagates through the residual stream. After layer 1, token i's hidden state contains information from positions within its window. Layer 2 reads those hidden states, extending the indirect reach by another w-1 positions. After L layers of SWA, a token can indirectly access information from up to 1 + L×(w-1) positions back.
This is why pure SWA architectures (all layers using sliding windows) struggle with tasks that require precise long-range dependencies — coreference across a long document, or recalling a fact stated 50K tokens ago. The interleaved pattern (some SWA, some full) addresses this: the full-attention layers provide direct long-range access, while SWA layers handle the cheaper local processing.
How you mix SWA and full-attention layers matters. Common patterns:
- Alternating (1:1): Every other layer is SWA. This gives frequent long-range access but only halves the attention cost. Used when long-range quality is critical.
- Sparse full (1:3 or 1:5): One full-attention layer for every 3-5 SWA layers. This saves significantly more compute and memory while maintaining some direct long-range paths. This is the pattern used in Gemma 4.
- Front-loaded: All full-attention layers in the first few positions, SWA for the rest. Early layers build a global context that later SWA layers refine locally. Less common but can be effective.
The choice depends on the model's target context length, the acceptable quality-cost tradeoff, and the training data distribution. There is no universally optimal pattern — it is determined empirically during model development.
SWA provides two distinct savings, and understanding both is important:
1. Attention compute: O(n × w) instead of O(n²)
Full causal attention computes scores for every pair of positions up to position i: roughly n²/2 total scores. SWA computes scores for at most w positions per query: roughly n × w total scores. When w << n (e.g., w = 4,096 and n = 100,000), this is a massive reduction. For a 100K-token sequence, full attention computes ~5 billion score entries; SWA with w = 4K computes ~400 million — a 12.5× reduction.
2. KV cache: bounded by w instead of growing with n
For full attention, the KV cache must store entries for all past tokens — it grows linearly with context. For an SWA layer, entries older than w positions back are never read again and can be evicted. The cache is bounded: at most w entries per head per layer, regardless of how long the sequence becomes. This is why SWA layers are particularly valuable for long-context serving — they cap memory growth.
Sequence of 8 tokens, window size w = 3. Token at position 6 can attend to:
Positions 0-3 are invisible to this layer. The cost of attention at this position drops from 7 dot products to 3.
In llama.cpp, models that use SWA set a per-layer attention window size. The KV cache management respects this window, evicting entries that fall outside it. Some models interleave SWA layers with full-attention layers.
SWA layers save both compute and memory. Compute: fewer dot products per token. Memory: the KV cache for an SWA layer holds at most w entries regardless of sequence length. For a 128K context with w = 4096, an SWA layer uses 32x less KV cache than a full-attention layer.
A model alternates SWA layers (w=4096) with full-attention layers. In a 20,000-token sequence, can an SWA layer at position 18,000 directly attend to position 1,000?
Why does SWA reduce the KV cache size for a layer?