Case Study: A Speed Win That Failed Equivalence
You optimize a kernel, get a 15% speedup, merge it — and a week later someone discovers the model outputs are subtly wrong. Perplexity increased, certain prompts produce garbled text, or a downstream benchmark regressed. The speed was real. The answers were wrong.
This happens more often than you might expect, and there are two common causes:
- Layout changes Repacking weights into a different memory layout for faster access can introduce transposition errors, alignment bugs, or incorrect stride calculations that silently corrupt results.
- Selector changes Choosing a different kernel variant (e.g., a quantized path or a fused operator) may have subtle numerical differences or outright bugs for certain input shapes.
The fix is a discipline: every optimization must be accompanied by a correctness check. The standard approach is to measure perplexity on a reference dataset before and after the change. If perplexity increases beyond a small tolerance, the optimization is rejected regardless of how fast it is.
Isolating the regression means bisecting: which commit introduced the divergence? Once found, you compare the before-and-after numerics at each operator boundary to find where the values first diverge. Often the divergence is tiny per operation but compounds across layers.
A small per-operation error is not necessarily harmless. Consider a model with 32 layers, where an optimization introduces a relative error of 0.1% per matrix multiply. Each layer has ~7 matrix multiplies. After one layer, the error is roughly 0.7%. After 32 layers:
The residual connections actually help here — they add the sub-layer output to the input, so a small error in the sub-layer is diluted by the much larger residual signal. But the attention mechanism can amplify errors: if the Q·K scores are perturbed, the softmax can shift attention to a different position entirely, creating a discontinuous change in the output even from a tiny input perturbation.
This is why "works fine on layer 1" is not sufficient validation. You must check the final outputs (logits or perplexity), not intermediate tensors.
When you discover a regression, here is the systematic approach to finding the root cause:
- Reproduce reliably. Find a specific prompt + model + config that shows the divergence. Record the baseline and optimized perplexity numbers. Make sure the difference is consistent across runs (rule out non-determinism from threading or initialization).
- Bisect commits. Use git bisect with the perplexity tool as the test oracle. This narrows the problem to a single commit — often a specific file or function change.
- Dump intermediate tensors. Run both baseline and optimized on the same input, dumping the output of each operation. Compare tensor values layer by layer, operation by operation. The first point where values diverge beyond FP tolerance is the bug location.
- Check the edge cases. Once you know which operation diverges, test with different input shapes. Many repack and kernel bugs only manifest when dimensions are not cleanly divisible by the SIMD width (e.g., d_model = 4097 instead of 4096).
- Fix and re-validate. After fixing the bug, re-run the full perplexity check — not just the specific prompt that exposed the bug. The fix should not introduce new regressions.
Manual validation catches regressions after the fact. Automated CI validation prevents them from merging. A robust inference project runs perplexity checks on every PR that touches:
- Kernel code (GEMM, GEMV, fused operators)
- Weight loading or repacking logic
- Quantization or dequantization routines
- Graph builder changes that alter operation order or fusion
The perplexity check takes minutes but prevents days of debugging regressions that only surface in production. Many projects gate merging on perplexity staying within a project-specific tolerance on a standard test set.
A weight repack optimization that introduces a silent bug:
GEMV reads rows sequentially → correct output
GEMV uses vectorized loads → 15% faster
Bug: stride calculation off by 1 for d_in not divisible by 4
Two layers: wrong values → perplexity +0.3
Undetectable by spot-checking a single layer
Perplexity as a correctness metric:
The perplexity tool runs the model on a reference dataset and reports the PPL score. If an optimization changes the PPL beyond tolerance, it is flagged. The repack code in the CPU backend handles weight layout transformations — this is exactly the kind of code where subtle stride or alignment bugs cause regressions.
Speed without correctness is meaningless. The fastest code that produces wrong answers is not an optimization — it is a bug. In practice, the validation step (running perplexity) takes minutes but prevents days of debugging regressions downstream. Many projects run perplexity checks in CI to catch these issues automatically.
A kernel optimization gives a 20% speedup. What must you verify before accepting it?
Why are weight layout changes a common source of silent regressions?
How do you isolate which part of an optimization introduced a numerical regression?