L33

GEMM and GEMV Feel Different on Hardware

15 min

Question

Why do GEMM and GEMV perform differently?

Intuition

GEMM (General Matrix-Matrix Multiply) multiplies two matrices. GEMV (General Matrix-Vector Multiply) multiplies a matrix by a single vector. Mathematically, GEMV is just GEMM where one matrix has only one column (or row). But on real hardware, the difference is enormous.

In GEMM, you load a block of the weight matrix once and reuse it across many input rows. The ratio of compute to memory access is high — hardware stays busy doing arithmetic. In GEMV, you still load the entire weight matrix, but you only use it for one vector. The ratio of compute to memory access is low — hardware spends most of its time waiting for data.

This maps directly to inference phases. Prefill processes many tokens at once, so projections are GEMM. Decode processes one token at a time, so projections are GEMV. The same weight matrix, the same hardware, but very different utilization.

Toy Example

Same weight matrix W [4096, 4096], different input shapes:

GEMM (prefill, 512 tokens):

Input: [512, 4096] × W [4096, 4096]

FLOPs: 512 × 4096 × 4096 × 2 ≈ 17.2 billion

Bytes loaded (W): 4096 × 4096 × 2 = 33.6 MB

Ratio: ~512 FLOPs per byte of W loaded

GEMV (decode, 1 token):

Input: [1, 4096] × W [4096, 4096]

FLOPs: 1 × 4096 × 4096 × 2 ≈ 33.6 million

Bytes loaded (W): 4096 × 4096 × 2 = 33.6 MB

Ratio: ~1 FLOP per byte of W loaded

The weight matrix is the same size both times. GEMM does 512× more arithmetic for the same data load.

Shapes

GEMM: [m, k] × [k, n] — m > 1, n > 1

GEMV: [1, k] × [k, n] — one input row

The "arithmetic intensity" (FLOPs per byte of memory traffic) scales with m. GEMV has m = 1, the lowest possible.

Different Problems Need Different Optimizations

This GEMM/GEMV distinction has direct consequences for optimization strategy:

Prefill (GEMM) Compute-bound. Better tiling, vectorization, and SIMD utilization directly reduce wall-clock time. Hardware accelerators (GPU tensor cores, AMX instructions) help substantially. Memory bandwidth matters less because weights are heavily reused.
Decode (GEMV) Memory-bound. Better compute kernels help very little because the hardware is idle waiting for data. The effective optimization is to reduce bytes moved: quantize weights from FP16 to INT8 or INT4, use fewer KV heads (GQA), or batch multiple decode requests together (increasing m from 1 to batch_size, turning GEMV back into a small GEMM).

This is why a single optimization can have wildly different impact on prefill vs decode. An engineer who understands the GEMM/GEMV distinction can predict before benchmarking whether a given optimization will help prefill, decode, both, or neither.

Math

Arithmetic intensity = FLOPs / bytes_moved

GEMM: ≈ 2 × m × k × n / (sizeof(element) × (m×k + k×n))

GEMV: ≈ 2 × k × n / (sizeof(element) × (k + k×n)) ≈ 2 / sizeof(element)

For FP16: GEMV intensity ≈ 1 FLOP/byte. Hardware can do 100+ FLOP/byte — GEMV wastes most of that capacity.

Implementation Hook

In llama.cpp, the same ggml_mul_mat operation is used for both GEMM and GEMV. The backend dispatch code selects different kernel implementations depending on the shapes — a GEMV-optimized path avoids the overhead of tiling strategies designed for large matrices.

ggml/src/ggml-cpu/ggml-cpu.c — ggml_compute_forward_mul_mat() (L1235)

Performance Hook

GEMM improvements (better tiling, SIMD utilization) speed up prefill substantially because the hardware can be kept busy with arithmetic. GEMV is usually bottlenecked on memory bandwidth, so the same GEMM optimizations help less during decode. To speed up decode, you need to reduce bytes moved (e.g., quantization) rather than increase compute throughput.

Check Yourself

reasoningQ1

Why does prefill benefit more from GEMM kernel improvements than decode does?

Prefill uses different weight matrices that are easier to optimizePrefill runs GEMM (many tokens, high arithmetic intensity) so compute improvements matter; decode runs GEMV (one token, memory-bound) where compute throughput is not the bottleneckDecode does not use matrix multiplication at all

conceptualQ2

What is the key difference between GEMM and GEMV in terms of hardware utilization?

GEMM reuses loaded weight data across many input rows, keeping ALUs busy; GEMV loads the same weights but only uses them onceGEMV uses special hardware units that GEMM cannot accessGEMM and GEMV have the same hardware utilization if the weight matrix is the same size