Quiz: Dense Transformer Block

10 min

Module Quiz

Dense Transformer Block

This quiz covers everything from M4: layer structure, residuals, normalization, Q/K/V projections, positional encoding, causal attention, multi-head attention, FFN, and the full decoder block.

You need 80% or better to proceed.

Shape Trace

Before the multiple choice, trace shapes through one attention sub-layer. The model has 20 tokens, d_model=1024, n_heads=16, d_head=64.

Attention sub-layer (20 tokens, d_model=1024, 16 heads)

Input hidden states

[,]

→After RMSNorm

[,]

→Q per head(d_head=64)

[,]

→K per head(d_head=64)

[,]

→Scores per head(Q·Kᵀ)

[,]

→After W_O projection(all heads → d_model)

[,]

→After residual add

[,]

Check Yourself

reasoningQ1

A 32-layer model processes a token. After all layers, the final hidden state still contains information from the original embedding. What mechanism makes this possible?

RMSNorm preserves the original values by normalizing scaleResidual connections — each layer adds its transform to the running sum, so original information carries throughThe embedding is re-read at every layer from a separate cache

conceptualQ2

In causal attention, what does the mask prevent?

Tokens from attending to themselvesTokens from attending to future positionsTokens from attending to the first token

shapeQ3

If d_model = 512 and there are 8 attention heads, what is d_head?

645124096

reasoningQ4

If you removed all attention layers but kept the FFN layers and residual connections, what would the model lose?

The ability to change the dimension of hidden statesThe ability for tokens to influence each other — each token would be processed in isolationThe ability to apply nonlinear transformations

conceptualQ5

What is the correct order of operations in a standard decoder block?

FFN → Attention → Norm → ResidualNorm → Attention → Residual → Norm → FFN → ResidualAttention → FFN → Norm → Residual

← Putting the Whole Decoder Block Together Grouped-Query and Multi-Query Attention →