GEMM and GEMV Feel Different on Hardware
GEMM (General Matrix-Matrix Multiply) multiplies two matrices. GEMV (General Matrix-Vector Multiply) multiplies a matrix by a single vector. Mathematically, GEMV is just GEMM where one matrix has only one column (or row). But on real hardware, the difference is enormous.
In GEMM, you load a block of the weight matrix once and reuse it across many input rows. The ratio of compute to memory access is high — hardware stays busy doing arithmetic. In GEMV, you still load the entire weight matrix, but you only use it for one vector. The ratio of compute to memory access is low — hardware spends most of its time waiting for data.
This maps directly to inference phases. Prefill processes many tokens at once, so projections are GEMM. Decode processes one token at a time, so projections are GEMV. The same weight matrix, the same hardware, but very different utilization.
Same weight matrix W [4096, 4096], different input shapes:
The weight matrix is the same size both times. GEMM does 512× more arithmetic for the same data load.
This GEMM/GEMV distinction has direct consequences for optimization strategy:
- Prefill (GEMM) Compute-bound. Better tiling, vectorization, and SIMD utilization directly reduce wall-clock time. Hardware accelerators (GPU tensor cores, AMX instructions) help substantially. Memory bandwidth matters less because weights are heavily reused.
- Decode (GEMV) Memory-bound. Better compute kernels help very little because the hardware is idle waiting for data. The effective optimization is to reduce bytes moved: quantize weights from FP16 to INT8 or INT4, use fewer KV heads (GQA), or batch multiple decode requests together (increasing m from 1 to batch_size, turning GEMV back into a small GEMM).
This is why a single optimization can have wildly different impact on prefill vs decode. An engineer who understands the GEMM/GEMV distinction can predict before benchmarking whether a given optimization will help prefill, decode, both, or neither.
In llama.cpp, the same ggml_mul_mat operation is used for both GEMM and GEMV. The backend dispatch code selects different kernel implementations depending on the shapes — a GEMV-optimized path avoids the overhead of tiling strategies designed for large matrices.
GEMM improvements (better tiling, SIMD utilization) speed up prefill substantially because the hardware can be kept busy with arithmetic. GEMV is usually bottlenecked on memory bandwidth, so the same GEMM optimizations help less during decode. To speed up decode, you need to reduce bytes moved (e.g., quantization) rather than increase compute throughput.
Why does prefill benefit more from GEMM kernel improvements than decode does?
What is the key difference between GEMM and GEMV in terms of hardware utilization?