Capstone: From Concept to Performance Diagnosis
This capstone covers the full arc of the course: from the structure of a transformer block, through attention variants and expert routing, to performance profiling and correctness validation.
There are six questions below. Each tests a different core competency. You need to pass all of them.
A dense transformer block has a specific internal structure. You should be able to name every phase in order and explain what each one does. The pattern is: two sub-blocks (attention and FFN), each wrapped in normalization and a residual connection.
A Mixture-of-Experts layer replaces the dense FFN with a router and multiple expert FFNs. You should be able to explain how the router scores experts, how top-k selection works, and how the selected expert outputs are combined.
Transformers have no inherent notion of position. You should be able to explain how positional information enters the attention computation — whether through absolute embeddings, relative encodings, or rotary position embeddings (RoPE) — and why it matters for the model to know token order.
Inference has two distinct phases with different computational characteristics. You should be able to explain what happens during prefill versus decode, why they have different bottlenecks, and how batch size (number of tokens processed at once) determines whether the system is compute-bound or memory-bound.
Given a profile showing operator-level time distribution, you should be able to identify the bottleneck class and propose a plausible diagnosis. The key is to read the distribution, not just the hottest function.
Speed without correctness is a bug, not an optimization. You should be able to explain why correctness validation is required after any optimization and how to detect a regression.
What are the six phases of a dense transformer block, in order?
In a MoE layer with 8 experts and top-2 routing, how is the final output for a token produced?
Why does attention need positional information, and how does RoPE provide it?
Why is prefill compute-bound but decode memory-bound, even though the same model is used?
A decode profile shows: GEMV 55%, KV cache reads 20%, softmax 3%, overhead 22%. What is the most likely bottleneck?
Why is perplexity checking required after a kernel optimization, even if the code compiles and runs without errors?