Q-M5
Quiz: Architecture Variants
Module 5 Quiz
This quiz covers all of Module 5: GQA/MQA, sliding-window attention, shared-KV layers, MoE routing, MoE aggregation, and heterogeneous architectures. You need 80% or better to proceed.
Check Yourself
Check Yourself
A model has 32 Q heads and 4 KV heads (GQA) with d_head=128. What is the K cache entry shape per token per layer?
In MoE, what does the router produce?
A model uses SWA with window size w = 512 and has 24 SWA layers stacked. Approximately how many tokens back can information theoretically reach?
Why do shared-KV layers reduce memory bandwidth during decode?
A MoE model has 64 experts with top-2 routing. During decode, how does this affect memory bandwidth compared to a dense model with the same per-token compute?