Quiz: Performance Reasoning — LLM From the Ground Up

Check Yourself

reasoningQ1

An operation loads a 4096×4096 FP16 weight matrix (32 MB) and multiplies it by a single vector. It performs ~33M FLOPs. The hardware has a machine balance of 100 FLOP/byte. Is this operation compute-bound or memory-bound?

Compute-bound — 33M FLOPs is a lot of arithmeticMemory-bound — intensity is 33M/32M ≈ 1 FLOP/byte, far below the 100 FLOP/byte balance pointNeither — it depends on the batch size

reasoningQ2

After quantizing a model from FP16 to Q8_0, you measure that decode speed improved by 40% but perplexity increased from 5.8 to 5.9. Is this a valid optimization?

No — any perplexity increase means the optimization is invalidLikely yes — the perplexity increase is small (~1.7%), suggesting quality is largely preserved, and the 40% speedup is substantialCannot tell — perplexity is not a valid metric for quantization

reasoningQ3

Decode processes one token per step, making weight projections into GEMV. An engineer adds a GEMM tiling optimization that improves large-matrix throughput by 30%. Which phase benefits more?

Decode, because it runs more steps overallPrefill, because it runs large GEMMs that are compute-bound and can use the extra throughput; decode is memory-bound and will not benefit muchBoth benefit equally since they use the same weight matrices

reasoningQ4

A model is quantized from FP16 to Q4_0. Decode speed improves by 50%, but perplexity on wikitext-2 increases from 5.8 to 7.2. What should you conclude?

The optimization is successful because it is fasterThe speed gain is real but the model quality has degraded significantly — this is not a safe drop-in replacementPerplexity always increases with quantization, so this is expected and acceptable

shapeQ5

A weight matrix has logical shape [4096, 4096]. After repacking for a SIMD kernel, what changes?

The logical shape changes to match the SIMD register widthThe weight values are modified to be more SIMD-friendlyThe physical memory layout changes so the kernel loads contiguous cache lines of useful data, but the logical shape and weight values stay the same