Quiz: Architecture Variants

12 min

Module 5 Quiz

This quiz covers all of Module 5: GQA/MQA, sliding-window attention, shared-KV layers, MoE routing, MoE aggregation, and heterogeneous architectures. You need 80% or better to proceed.

Check Yourself

reasoningQ1

A model has 32 Q heads and 4 KV heads (GQA) with d_head=128. What is the K cache entry shape per token per layer?

[32, 128] — one K per Q head[4, 128] — one K per KV head, each shared across 8 Q heads[1, 128] — GQA shares a single K across all heads

conceptualQ2

In MoE, what does the router produce?

A new hidden state for each tokenOne score per expert for each token, used to select which experts runA compressed version of the KV cache

mathQ3

A model uses SWA with window size w = 512 and has 24 SWA layers stacked. Approximately how many tokens back can information theoretically reach?

512 tokens — each layer sees only its own windowAbout 12,000 tokens — each layer extends the reach by w-1 through the residual streamUnlimited — stacking enough layers removes the window constraint entirely

conceptualQ4

Why do shared-KV layers reduce memory bandwidth during decode?

They skip the FFN entirelyThey avoid loading W_K and W_V weights and writing new KV cache entriesThey use a smaller d_model

reasoningQ5

A MoE model has 64 experts with top-2 routing. During decode, how does this affect memory bandwidth compared to a dense model with the same per-token compute?

Same bandwidth — only 2 experts run, so the same amount of weight data is loadedHigher bandwidth — all 64 experts' weights must be in memory even though only 2 run per token, and different tokens may select different expertsLower bandwidth — MoE models are smaller than dense models

← Gemma 4 Combines Several Variants at Once Prefill Processes the Prompt →