Quiz: Performance Reasoning
Performance Reasoning
This quiz covers everything from M7: operator time breakdown, GEMM vs GEMV, compute-bound vs memory-bound analysis, quantization, memory layout and repacking, thread count and affinity, and validation of speedups.
You need 80% or better to proceed.
An operation loads a 4096×4096 FP16 weight matrix (32 MB) and multiplies it by a single vector. It performs ~33M FLOPs. The hardware has a machine balance of 100 FLOP/byte. Is this operation compute-bound or memory-bound?
After quantizing a model from FP16 to Q8_0, you measure that decode speed improved by 40% but perplexity increased from 5.8 to 5.9. Is this a valid optimization?
Decode processes one token per step, making weight projections into GEMV. An engineer adds a GEMM tiling optimization that improves large-matrix throughput by 30%. Which phase benefits more?
A model is quantized from FP16 to Q4_0. Decode speed improves by 50%, but perplexity on wikitext-2 increases from 5.8 to 7.2. What should you conclude?
A weight matrix has logical shape [4096, 4096]. After repacking for a SIMD kernel, what changes?