L43

Case Study: A Speed Win That Failed Equivalence

20 min

Question

What happens when a speed win is invalid?

Intuition

You optimize a kernel, get a 15% speedup, merge it — and a week later someone discovers the model outputs are subtly wrong. Perplexity increased, certain prompts produce garbled text, or a downstream benchmark regressed. The speed was real. The answers were wrong.

This happens more often than you might expect, and there are two common causes:

Layout changes Repacking weights into a different memory layout for faster access can introduce transposition errors, alignment bugs, or incorrect stride calculations that silently corrupt results.
Selector changes Choosing a different kernel variant (e.g., a quantized path or a fused operator) may have subtle numerical differences or outright bugs for certain input shapes.

The fix is a discipline: every optimization must be accompanied by a correctness check. The standard approach is to measure perplexity on a reference dataset before and after the change. If perplexity increases beyond a small tolerance, the optimization is rejected regardless of how fast it is.

Isolating the regression means bisecting: which commit introduced the divergence? Once found, you compare the before-and-after numerics at each operator boundary to find where the values first diverge. Often the divergence is tiny per operation but compounds across layers.

How Errors Compound Across Layers

A small per-operation error is not necessarily harmless. Consider a model with 32 layers, where an optimization introduces a relative error of 0.1% per matrix multiply. Each layer has ~7 matrix multiplies. After one layer, the error is roughly 0.7%. After 32 layers:

Per matmul error: 0.001 (relative)

Matmuls per layer: ~7

Layers: 32

Loose worst-case intuition (scalar model): (1.001)^(7 × 32) ≈ 1.25

Real tensors do not compound this cleanly — residual connections dilute errors, norms re-scale them, and positive/negative deviations partially cancel. But the key point stands: small per-operator errors can amplify nonlinearly across layers, especially through softmax (which can shift attention to entirely different positions from a tiny score change). The only reliable check is at the final output.

The residual connections actually help here — they add the sub-layer output to the input, so a small error in the sub-layer is diluted by the much larger residual signal. But the attention mechanism can amplify errors: if the Q·K scores are perturbed, the softmax can shift attention to a different position entirely, creating a discontinuous change in the output even from a tiny input perturbation.

This is why "works fine on layer 1" is not sufficient validation. You must check the final outputs (logits or perplexity), not intermediate tensors.

The Debugging Workflow

When you discover a regression, here is the systematic approach to finding the root cause:

Reproduce reliably. Find a specific prompt + model + config that shows the divergence. Record the baseline and optimized perplexity numbers. Make sure the difference is consistent across runs (rule out non-determinism from threading or initialization).
Bisect commits. Use git bisect with the perplexity tool as the test oracle. This narrows the problem to a single commit — often a specific file or function change.
Dump intermediate tensors. Run both baseline and optimized on the same input, dumping the output of each operation. Compare tensor values layer by layer, operation by operation. The first point where values diverge beyond FP tolerance is the bug location.
Check the edge cases. Once you know which operation diverges, test with different input shapes. Many repack and kernel bugs only manifest when dimensions are not cleanly divisible by the SIMD width (e.g., d_model = 4097 instead of 4096).
Fix and re-validate. After fixing the bug, re-run the full perplexity check — not just the specific prompt that exposed the bug. The fix should not introduce new regressions.

Why CI Validation Matters

Manual validation catches regressions after the fact. Automated CI validation prevents them from merging. A robust inference project runs perplexity checks on every PR that touches:

Kernel code (GEMM, GEMV, fused operators)
Weight loading or repacking logic
Quantization or dequantization routines
Graph builder changes that alter operation order or fusion

The perplexity check takes minutes but prevents days of debugging regressions that only surface in production. Many projects gate merging on perplexity staying within a project-specific tolerance on a standard test set.

Toy Example

A weight repack optimization that introduces a silent bug:

Before (correct):

W stored as [d_out, d_in] row-major
GEMV reads rows sequentially → correct output

After (optimized repack):

W repacked into [d_out/4, d_in, 4] blocked layout
GEMV uses vectorized loads → 15% faster
Bug: stride calculation off by 1 for d_in not divisible by 4

Effect:

Most layers: correct (d_in divisible by 4)
Two layers: wrong values → perplexity +0.3
Undetectable by spot-checking a single layer

Shapes

Original layout: [d_out, d_in] contiguous

Repacked layout: [d_out/block, d_in, block] (blocked for vectorized access)

The mathematical result must be identical. Any difference is a bug.

Math

Perplexity as a correctness metric:

PPL = exp(−(1/N) ∑ log P(token_i | context))

Lower PPL = better predictions. A lossless optimization (repack, threading) should produce identical PPL.

Lossy changes (quantization): acceptable tolerance is project-specific, typically < 0.1 PPL increase.

Implementation Hook

The perplexity tool runs the model on a reference dataset and reports the PPL score. If an optimization changes the PPL beyond tolerance, it is flagged. The repack code in the CPU backend handles weight layout transformations — this is exactly the kind of code where subtle stride or alignment bugs cause regressions.

tools/perplexity/perplexity.cpp — main() (L2007)

ggml/src/ggml-cpu/repack.cpp

Performance Hook

Speed without correctness is meaningless. The fastest code that produces wrong answers is not an optimization — it is a bug. In practice, the validation step (running perplexity) takes minutes but prevents days of debugging regressions downstream. Many projects run perplexity checks in CI to catch these issues automatically.

Check Yourself

conceptualQ1

A kernel optimization gives a 20% speedup. What must you verify before accepting it?

That the kernel runs on all GPU architecturesThat model output quality (e.g., perplexity) is unchanged within toleranceThat the kernel uses less memory than before

conceptualQ2

Why are weight layout changes a common source of silent regressions?

They change the model architectureThey can introduce stride, alignment, or transposition errors that produce wrong values without crashingThey always reduce numerical precision

conceptualQ3

How do you isolate which part of an optimization introduced a numerical regression?

Run the model at half precision and compareBisect commits, then compare per-operator numerics between the before and after versions to find where values first divergeProfile the model and find the slowest operator