M5/Architecture Variants
Q-M5

Quiz: Architecture Variants

12 min

This quiz covers all of Module 5: GQA/MQA, sliding-window attention, shared-KV layers, MoE routing, MoE aggregation, and heterogeneous architectures. You need 80% or better to proceed.

Check Yourself
reasoningQ1

A model has 32 Q heads and 4 KV heads (GQA) with d_head=128. What is the K cache entry shape per token per layer?

conceptualQ2

In MoE, what does the router produce?

mathQ3

A model uses SWA with window size w = 512 and has 24 SWA layers stacked. Approximately how many tokens back can information theoretically reach?

conceptualQ4

Why do shared-KV layers reduce memory bandwidth during decode?

reasoningQ5

A MoE model has 64 experts with top-2 routing. During decode, how does this affect memory bandwidth compared to a dense model with the same per-token compute?