A Speedup Is Not Valid Until You Validate Outputs
Making inference faster is only useful if the model still produces correct outputs. Many optimizations — quantization, kernel rewrites, layout changes, reduced precision — can subtly alter the numbers flowing through the model. A speed benchmark alone (tokens per second) cannot detect whether the outputs have degraded.
Logits comparison is the most direct check: run the same prompt through the original and optimized models and compare the raw logit vectors. For truly lossless changes (e.g., a memory layout repack with the same thread count and reduction order), logits should match within floating-point epsilon. Note that even "lossless" changes like different thread counts can alter the order of floating-point additions, producing small differences due to FP non-associativity. For lossy changes (quantization, reduced precision), the logits will differ — the question is whether the difference is acceptable. Logit comparison alone cannot answer that; you need perplexity or task-level evaluation.
Perplexity provides a higher-level check. Perplexity measures how surprised the model is by a test dataset — specifically, it is the exponentiated average negative log-likelihood of the correct next token. Lower perplexity means the model assigns higher probability to the right answers, which means better predictions. Typical large models score around 5–8 on wikitext-2. If your model scores 7.2 before an optimization and 7.3 after, that is likely within tolerance. If it jumps to 9.0+, something has broken. If perplexity increases meaningfully after an optimization, the model has gotten worse at predicting, even if it runs faster.
A valid speedup shows: same or negligibly different logits, and no perplexity regression on a standard test set.
"Compare the logits" sounds simple, but there are several ways to measure divergence, each catching different kinds of problems:
1. Max absolute difference
For each token position, compute max|logit_baseline - logit_optimized| across the vocabulary. This catches the worst-case divergence. For lossless optimizations (repack, thread count changes), this should be zero or within floating-point epsilon (~1e-5 for FP32, ~1e-3 for FP16). If the max diff exceeds 0.01, something has changed beyond rounding.
2. Cosine similarity of logit vectors
Treat each position's logit vector as a point in vocab_size-dimensional space. Cosine similarity near 1.0 means the vectors point in nearly the same direction — the overall distribution shape is preserved. This is useful for lossy optimizations like quantization where absolute values will differ. Note that high cosine similarity does not guarantee identical token rankings: two close logits can still swap order. But a cosine similarity below 0.999 at any position signals that the distribution has shifted enough to warrant investigation.
3. KL divergence of probability distributions
Convert logits to probabilities (softmax), then compute KL(baseline || optimized). This measures how much the sampling distribution has changed. KL divergence is sensitive to changes in the tail of the distribution — tokens with low probability. A KL divergence above 0.01 nats averaged across positions suggests meaningful behavioral change.
In practice, start with max absolute difference for lossless changes and perplexity for lossy ones. Use cosine similarity and KL divergence when you need to understand how the outputs differ, not just whether they differ.
A disciplined validation process follows a specific order, because each step catches different failure modes:
- Bitwise logit comparison on a short prompt. Run 5-10 prompts through both baseline and optimized. Compare raw logits. If max absolute diff is zero (or within FP epsilon), you have strong evidence that the optimization is numerically identical for the tested inputs. (Edge-case shapes may still differ — see step 4.)
- Perplexity on a reference dataset. If logits differ (expected for quantization, precision changes), measure perplexity on wikitext-2 or a similar standard set. Compare against the baseline number. Record both numbers and the delta.
- Generation quality spot-check. Run the model on a diverse set of prompts (code, reasoning, factual recall, creative) and manually inspect the outputs. Perplexity can miss mode collapse or systematic biases that only appear for certain prompt types.
- Edge-case testing. Try inputs that stress the optimization: very long contexts, unusual token sequences, inputs that trigger all branches of a fused kernel. Many bugs only manifest for specific input shapes (e.g., dimensions not divisible by the SIMD width).
Steps 1-2 should be automated and run in CI. Steps 3-4 are manual but essential for high-risk changes.
Understanding how optimizations typically fail helps you know what to look for:
- Stride/alignment bugs in repack: The weights are rearranged in memory but a stride calculation is off by one element. Most layers work fine (dimensions happen to be aligned), but a few layers silently read wrong values. Perplexity increases slightly — easy to miss without automated checks.
- Precision loss in fused kernels: A fused kernel combines multiple operations (e.g., GEMV + bias + activation) using intermediate FP16 where the original used FP32. Each step loses a small amount of precision. Individually negligible, but compounded across 32+ layers the error grows.
- Quantization of sensitive layers: Not all layers tolerate quantization equally. Attention projection layers (especially Q and K) are often more sensitive than FFN layers. A uniform quantization scheme that works on average may degrade specific layers badly.
- Non-determinism from threading: Changing the thread count or thread dispatch order can change the order of floating-point additions. Due to non-associativity of FP math (a+b)+c ≠ a+(b+c), this produces different results. Usually within tolerance, but it means logit comparison must account for FP non-determinism.
Comparing baseline vs optimized on a test set:
In llama.cpp, the perplexity tool evaluates a model on a test dataset (typically wikitext-2) and reports perplexity. This is the standard way to check whether a quantization or kernel change has regressed model quality. Running it before and after any optimization is essential practice.
Every performance optimization PR in a serious inference project should include both speed numbers and quality validation. "Faster" alone is not enough — you need to show that logits or perplexity have not regressed. Without this, a speed win might ship a broken model.
An optimization makes inference 40% faster but the logits diverge significantly from the baseline. Is this a successful optimization?
You optimized a kernel and decode speed improved by 25%. Perplexity on wikitext-2 went from 7.2 to 7.3. Your colleague's optimization improved speed by 35% but perplexity went from 7.2 to 9.1. Which optimization(s) should ship?