M7/Performance
L34

Compute-Bound and Memory-Bound

18 min
How do you tell if a workload is compute-bound or memory-bound?

Every operation needs two things: it needs to move data (load weights, read inputs, write outputs) and it needs to do arithmetic (multiply, add). Hardware has a maximum rate for each: memory bandwidth (bytes/second) and compute throughput (FLOPs/second).

If the operation finishes its arithmetic before the data has arrived, it is memory-bound — the compute units sit idle waiting for memory. If data arrives faster than the hardware can process it, the operation is compute-bound — the memory system sits idle waiting for the ALUs.

The dividing line is the hardware's operational intensity threshold: the ratio of peak FLOPs/second to peak bytes/second. If your operation's arithmetic intensity (FLOPs per byte moved) is below this threshold, it is memory-bound. If above, it is compute-bound.

A key insight: an operation can have very few total FLOPs and still be slow if it is memory-bound. A single decode step through a 7B-parameter model does relatively little arithmetic per weight, but must load gigabytes of weights through the memory bus.

The roofline model is a visual tool for understanding performance limits. It plots achievable performance (FLOP/s) against arithmetic intensity (FLOP/byte). The plot has two regimes:

Performance
(FLOP/s)
|
| _______________ ← peak FLOP/s (flat ceiling)
| /
| / ← memory bandwidth slope
| /
| /
| /
+------|------------------→ Arithmetic Intensity (FLOP/byte)
^
machine balance point

On the left slope, performance is limited by memory bandwidth: the hardware can compute faster than data arrives. Every byte you save (through quantization, smaller types, better layout) directly improves speed. On the right flat region, performance is limited by compute: data arrives faster than the hardware can process it. Better algorithms, SIMD utilization, and kernel optimization help here.

The machine balance point (also called the "ridge point") is where the slope meets the ceiling. Its x-coordinate is peak_FLOP/s ÷ peak_bandwidth — the hardware's characteristic ratio. Operations to the left of this point are memory-bound; operations to the right are compute-bound.

The same weight matrix, the same kernel — but with different batch sizes, the operation can be either memory-bound or compute-bound. This is the single most important performance concept in inference:

Weight matrix W: [4096, 4096] in FP16 = 4096 × 4096 × 2 bytes = 32 MiB
Each token: 2 × 4096 × 4096 ≈ 33.6M FLOPs (2 for multiply-add)
Batch = 1 token: 33.6M FLOPs / 32 MiB ≈ 1 FLOP/byte → memory-bound
Batch = 32 tokens: 32 × 33.6M / 32 MiB ≈ 32 FLOP/byte
Batch = 256 tokens: 256 × 33.6M / 32 MiB ≈ 256 FLOP/byte → compute-bound
The weight matrix is loaded once regardless of batch size. More tokens = more FLOPs per byte loaded.

This is why prefill (hundreds of tokens) is compute-bound and decode (one token) is memory-bound, even though they execute the same operations on the same weights. The batch size determines which resource is the bottleneck.

It also explains why batching multiple users' decode steps together (continuous batching) improves throughput: combining 32 single-token decodes into one batch of 32 shifts the arithmetic intensity from ~1 to ~32 FLOP/byte, potentially moving the operation from deep in the memory-bound region toward the ridge point.

Machine balance varies enormously across hardware. This determines which operations are memory-bound on your specific system:

Machine balance = Peak FLOP/s ÷ Peak Bandwidth
Apple M2 Ultra:   ~27 TFLOP/s FP16 / 800 GB/s = ~34 FLOP/byte
NVIDIA A100 80GB: ~312 TFLOP/s FP16 / 2039 GB/s = ~153 FLOP/byte
AMD EPYC (DDR5):  ~2.4 TFLOP/s FP32 / 460 GB/s = ~5 FLOP/byte

On the A100, an operation needs 153+ FLOP/byte to be compute-bound — single-token decode at ~1 FLOP/byte is 150× below the ridge point. On a CPU with lower balance (~5), the same operation is still memory-bound but less extremely so.

Hardware: 10 TFLOP/s compute, 100 GB/s memory bandwidth.
Threshold (machine balance) = Peak FLOP/s ÷ Peak bytes/s = 10T ÷ 100G = 100 FLOP/byte.
If an operation does more than 100 FLOPs per byte loaded, compute is the bottleneck. If fewer, memory is.

Operation A: Large GEMM
FLOPs: 2 billion  |  Bytes moved: 8 MB
Intensity: 2B / 8M = 250 FLOP/byte
250 > 100 → compute-bound
Time ≈ 2B / 10T = 0.2 ms (limited by ALUs)
Operation B: GEMV (decode projection)
FLOPs: 33 million  |  Bytes moved: 32 MB
Intensity: 33M / 32M = ~1 FLOP/byte
1 < 100 → memory-bound
Time ≈ 32M / 100G = 0.32 ms (limited by bandwidth)

Operation B does 60× fewer FLOPs than A but takes longer, because it is memory-bound.

The roofline model divides workloads into two regimes:
Memory-bound: intensity < peak_FLOPS / peak_bandwidth
Compute-bound: intensity > peak_FLOPS / peak_bandwidth
GEMV is almost always memory-bound. Large GEMM is almost always compute-bound. Attention score computation falls in between depending on sequence length.
arithmetic_intensity = total_FLOPs / total_bytes_moved
machine_balance = peak_FLOPs_per_sec / peak_bytes_per_sec
if arithmetic_intensity < machine_balance: memory-bound
if arithmetic_intensity > machine_balance: compute-bound
Achievable performance = min(peak_FLOPS, arithmetic_intensity × peak_bandwidth)

In llama.cpp, whether a ggml_mul_mat call ends up compute-bound or memory-bound depends on the input batch size. The same kernel, the same weights — but with n_tokens = 1 (decode) the operation is memory-bound, and with n_tokens = 512 (prefill) it is compute-bound. Backend code may choose different execution strategies based on these shapes.

Correctly classifying your bottleneck changes your optimization strategy entirely. For compute-bound work, you want better SIMD utilization, tiling, and instruction scheduling. For memory-bound work, you want to reduce bytes moved: quantization, smaller data types, and better memory access patterns. Applying the wrong optimization category wastes effort.

Check Yourself
reasoningQ1

An operation does 50 million FLOPs and moves 200 MB of data. On hardware with a machine balance of 100 FLOP/byte, is it compute-bound or memory-bound?

reasoningQ2

You quantize weights from FP16 to INT8, halving the bytes moved. For a memory-bound decode step, what speedup do you expect? For a compute-bound prefill step?