M4/Transformer Block
Q-M4

Quiz: Dense Transformer Block

10 min

Dense Transformer Block

This quiz covers everything from M4: layer structure, residuals, normalization, Q/K/V projections, positional encoding, causal attention, multi-head attention, FFN, and the full decoder block.

You need 80% or better to proceed.

Before the multiple choice, trace shapes through one attention sub-layer. The model has 20 tokens, d_model=1024, n_heads=16, d_head=64.

Attention sub-layer (20 tokens, d_model=1024, 16 heads)
Input hidden states
[,]
After RMSNorm
[,]
Q per head(d_head=64)
[,]
K per head(d_head=64)
[,]
Scores per head(Q·Kᵀ)
[,]
After W_O projection(all heads → d_model)
[,]
After residual add
[,]
Check Yourself
reasoningQ1

A 32-layer model processes a token. After all layers, the final hidden state still contains information from the original embedding. What mechanism makes this possible?

conceptualQ2

In causal attention, what does the mask prevent?

shapeQ3

If d_model = 512 and there are 8 attention heads, what is d_head?

reasoningQ4

If you removed all attention layers but kept the FFN layers and residual connections, what would the model lose?

conceptualQ5

What is the correct order of operations in a standard decoder block?