L34

Compute-Bound and Memory-Bound

18 min

Question

How do you tell if a workload is compute-bound or memory-bound?

Intuition

Every operation needs two things: it needs to move data (load weights, read inputs, write outputs) and it needs to do arithmetic (multiply, add). Hardware has a maximum rate for each: memory bandwidth (bytes/second) and compute throughput (FLOPs/second).

If the operation finishes its arithmetic before the data has arrived, it is memory-bound — the compute units sit idle waiting for memory. If data arrives faster than the hardware can process it, the operation is compute-bound — the memory system sits idle waiting for the ALUs.

The dividing line is the hardware's operational intensity threshold: the ratio of peak FLOPs/second to peak bytes/second. If your operation's arithmetic intensity (FLOPs per byte moved) is below this threshold, it is memory-bound. If above, it is compute-bound.

A key insight: an operation can have very few total FLOPs and still be slow if it is memory-bound. A single decode step through a 7B-parameter model does relatively little arithmetic per weight, but must load gigabytes of weights through the memory bus.

The Roofline Model

The roofline model is a visual tool for understanding performance limits. It plots achievable performance (FLOP/s) against arithmetic intensity (FLOP/byte). The plot has two regimes:

Performance

(FLOP/s)

| _______________ ← peak FLOP/s (flat ceiling)

| /

| / ← memory bandwidth slope

| /

+------|------------------→ Arithmetic Intensity (FLOP/byte)

machine balance point

On the left slope, performance is limited by memory bandwidth: the hardware can compute faster than data arrives. Every byte you save (through quantization, smaller types, better layout) directly improves speed. On the right flat region, performance is limited by compute: data arrives faster than the hardware can process it. Better algorithms, SIMD utilization, and kernel optimization help here.

The machine balance point (also called the "ridge point") is where the slope meets the ceiling. Its x-coordinate is peak_FLOP/s ÷ peak_bandwidth — the hardware's characteristic ratio. Operations to the left of this point are memory-bound; operations to the right are compute-bound.

How Batch Size Shifts the Regime

The same weight matrix, the same kernel — but with different batch sizes, the operation can be either memory-bound or compute-bound. This is the single most important performance concept in inference:

Weight matrix W: [4096, 4096] in FP16 = 4096 × 4096 × 2 bytes = 32 MiB

Each token: 2 × 4096 × 4096 ≈ 33.6M FLOPs (2 for multiply-add)

Batch = 1 token: 33.6M FLOPs / 32 MiB ≈ 1 FLOP/byte → memory-bound

Batch = 32 tokens: 32 × 33.6M / 32 MiB ≈ 32 FLOP/byte

Batch = 256 tokens: 256 × 33.6M / 32 MiB ≈ 256 FLOP/byte → compute-bound

The weight matrix is loaded once regardless of batch size. More tokens = more FLOPs per byte loaded.

This is why prefill (hundreds of tokens) is compute-bound and decode (one token) is memory-bound, even though they execute the same operations on the same weights. The batch size determines which resource is the bottleneck.

It also explains why batching multiple users' decode steps together (continuous batching) improves throughput: combining 32 single-token decodes into one batch of 32 shifts the arithmetic intensity from ~1 to ~32 FLOP/byte, potentially moving the operation from deep in the memory-bound region toward the ridge point.

Real Hardware Numbers

Machine balance varies enormously across hardware. This determines which operations are memory-bound on your specific system:

Machine balance = Peak FLOP/s ÷ Peak Bandwidth

Apple M2 Ultra: ~27 TFLOP/s FP16 / 800 GB/s = ~34 FLOP/byte

NVIDIA A100 80GB: ~312 TFLOP/s FP16 / 2039 GB/s = ~153 FLOP/byte

AMD EPYC (DDR5): ~2.4 TFLOP/s FP32 / 460 GB/s = ~5 FLOP/byte

On the A100, an operation needs 153+ FLOP/byte to be compute-bound — single-token decode at ~1 FLOP/byte is 150× below the ridge point. On a CPU with lower balance (~5), the same operation is still memory-bound but less extremely so.

Toy Example

Hardware: 10 TFLOP/s compute, 100 GB/s memory bandwidth.
Threshold (machine balance) = Peak FLOP/s ÷ Peak bytes/s = 10T ÷ 100G = 100 FLOP/byte.
If an operation does more than 100 FLOPs per byte loaded, compute is the bottleneck. If fewer, memory is.

Operation A: Large GEMM

FLOPs: 2 billion | Bytes moved: 8 MB

Intensity: 2B / 8M = 250 FLOP/byte

250 > 100 → compute-bound

Time ≈ 2B / 10T = 0.2 ms (limited by ALUs)

Operation B: GEMV (decode projection)

FLOPs: 33 million | Bytes moved: 32 MB

Intensity: 33M / 32M = ~1 FLOP/byte

1 < 100 → memory-bound

Time ≈ 32M / 100G = 0.32 ms (limited by bandwidth)

Operation B does 60× fewer FLOPs than A but takes longer, because it is memory-bound.

Shapes

The roofline model divides workloads into two regimes:

Memory-bound: intensity < peak_FLOPS / peak_bandwidth

Compute-bound: intensity > peak_FLOPS / peak_bandwidth

GEMV is almost always memory-bound. Large GEMM is almost always compute-bound. Attention score computation falls in between depending on sequence length.

Math

arithmetic_intensity = total_FLOPs / total_bytes_moved

machine_balance = peak_FLOPs_per_sec / peak_bytes_per_sec

if arithmetic_intensity < machine_balance: memory-bound

if arithmetic_intensity > machine_balance: compute-bound

Achievable performance = min(peak_FLOPS, arithmetic_intensity × peak_bandwidth)

Implementation Hook

In llama.cpp, whether a ggml_mul_mat call ends up compute-bound or memory-bound depends on the input batch size. The same kernel, the same weights — but with n_tokens = 1 (decode) the operation is memory-bound, and with n_tokens = 512 (prefill) it is compute-bound. Backend code may choose different execution strategies based on these shapes.

ggml/src/ggml-cpu/ggml-cpu.c — ggml_compute_forward_mul_mat() (L1235)

Performance Hook

Correctly classifying your bottleneck changes your optimization strategy entirely. For compute-bound work, you want better SIMD utilization, tiling, and instruction scheduling. For memory-bound work, you want to reduce bytes moved: quantization, smaller data types, and better memory access patterns. Applying the wrong optimization category wastes effort.

Check Yourself

reasoningQ1

An operation does 50 million FLOPs and moves 200 MB of data. On hardware with a machine balance of 100 FLOP/byte, is it compute-bound or memory-bound?

Compute-bound, because 50 million FLOPs is a lot of workMemory-bound, because its intensity (0.25 FLOP/byte) is far below the machine balance (100 FLOP/byte)Neither — the two are always balanced

reasoningQ2

You quantize weights from FP16 to INT8, halving the bytes moved. For a memory-bound decode step, what speedup do you expect? For a compute-bound prefill step?

Both get ~2× speedup because both move half the dataDecode gets ~2× speedup (memory-bound, so halving bytes nearly halves time); prefill gets little speedup (compute-bound, so bytes are not the bottleneck)Neither benefits because quantization only affects quality, not speed