L22

Sliding-Window Attention Limits What a Layer Can See

14 min

Question

What does a sliding window limit?

Intuition

Standard causal attention lets every token attend to all previous tokens. This means the attention matrix grows with the square of the sequence length. For very long contexts, this cost becomes prohibitive.

Sliding-window attention (SWA) restricts each token to attending only to the most recent w tokens, where w is the window size. Tokens beyond the window are masked out, as if they do not exist for that layer.

A single SWA layer cannot see beyond its window. But when you stack multiple SWA layers, the effective receptive field grows: each additional layer extends the indirect reach by another w-1 positions through the residual stream.

Modern architectures often interleave local SWA layers with occasional full-attention layers. The local layers handle nearby context cheaply, while the full-attention layers maintain long-range connections.

Receptive Field Stacking

A single SWA layer with window w can only access w positions (including itself). But information in a transformer propagates through the residual stream. After layer 1, token i's hidden state contains information from positions within its window. Layer 2 reads those hidden states, extending the indirect reach by another w-1 positions. After L layers of SWA, a token can indirectly access information from up to 1 + L×(w-1) positions back.

Example: 6 SWA layers, w = 4096

Effective receptive field: 1 + 6 × 4095 = 24,571 tokens

But this is indirect access — information must pass through multiple transformations. The signal from a distant token is weaker and more diffused than direct attention would provide. This is the fundamental tradeoff of SWA: cheaper but with degraded long-range fidelity.

This is why pure SWA architectures (all layers using sliding windows) struggle with tasks that require precise long-range dependencies — coreference across a long document, or recalling a fact stated 50K tokens ago. The interleaved pattern (some SWA, some full) addresses this: the full-attention layers provide direct long-range access, while SWA layers handle the cheaper local processing.

Interleaving Patterns

How you mix SWA and full-attention layers matters. Common patterns:

Alternating (1:1): Every other layer is SWA. This gives frequent long-range access but only halves the attention cost. Used when long-range quality is critical.
Sparse full (1:3 or 1:5): One full-attention layer for every 3-5 SWA layers. This saves significantly more compute and memory while maintaining some direct long-range paths. This is the pattern used in Gemma 4.
Front-loaded: All full-attention layers in the first few positions, SWA for the rest. Early layers build a global context that later SWA layers refine locally. Less common but can be effective.

The choice depends on the model's target context length, the acceptable quality-cost tradeoff, and the training data distribution. There is no universally optimal pattern — it is determined empirically during model development.

Why SWA Saves Memory and Compute

SWA provides two distinct savings, and understanding both is important:

1. Attention compute: O(n × w) instead of O(n²)

Full causal attention computes scores for every pair of positions up to position i: roughly n²/2 total scores. SWA computes scores for at most w positions per query: roughly n × w total scores. When w << n (e.g., w = 4,096 and n = 100,000), this is a massive reduction. For a 100K-token sequence, full attention computes ~5 billion score entries; SWA with w = 4K computes ~400 million — a 12.5× reduction.

2. KV cache: bounded by w instead of growing with n

For full attention, the KV cache must store entries for all past tokens — it grows linearly with context. For an SWA layer, entries older than w positions back are never read again and can be evicted. The cache is bounded: at most w entries per head per layer, regardless of how long the sequence becomes. This is why SWA layers are particularly valuable for long-context serving — they cap memory growth.

Toy Example

Sequence of 8 tokens, window size w = 3. Token at position 6 can attend to:

full causal: positions [0, 1, 2, 3, 4, 5, 6] 7 tokens

SWA (w=3): positions [4, 5, 6] 3 tokens

Positions 0-3 are invisible to this layer. The cost of attention at this position drops from 7 dot products to 3.

Shapes

Full attention scores: [n_tokens, n_tokens] quadratic

SWA attention scores: [n_tokens, w] linear in sequence length

KV cache per SWA layer also shrinks: only the last w entries are needed.

Math

score(i, j) = Qᵢ · Kⱼ only if i - w < j ≤ i

All other positions get -∞ before softmax (masked out).

Implementation Hook

In llama.cpp, models that use SWA set a per-layer attention window size. The KV cache management respects this window, evicting entries that fall outside it. Some models interleave SWA layers with full-attention layers.

src/llama-graph.cpp — build_attn()

Performance Hook

SWA layers save both compute and memory. Compute: fewer dot products per token. Memory: the KV cache for an SWA layer holds at most w entries regardless of sequence length. For a 128K context with w = 4096, an SWA layer uses 32x less KV cache than a full-attention layer.

Check Yourself

reasoningQ1

A model alternates SWA layers (w=4096) with full-attention layers. In a 20,000-token sequence, can an SWA layer at position 18,000 directly attend to position 1,000?

No — position 1,000 is 17,000 tokens back, far outside the window of 4,096Yes — the alternating full-attention layers propagate that information into the hidden stateYes — the window wraps around for very long sequences

conceptualQ2

Why does SWA reduce the KV cache size for a layer?

It uses fewer KV headsIt only needs to store the last w key/value entries, not the full sequenceIt compresses the K and V vectors to a smaller dimension