Quiz: Dense Transformer Block
Dense Transformer Block
This quiz covers everything from M4: layer structure, residuals, normalization, Q/K/V projections, positional encoding, causal attention, multi-head attention, FFN, and the full decoder block.
You need 80% or better to proceed.
Before the multiple choice, trace shapes through one attention sub-layer. The model has 20 tokens, d_model=1024, n_heads=16, d_head=64.
A 32-layer model processes a token. After all layers, the final hidden state still contains information from the original embedding. What mechanism makes this possible?
In causal attention, what does the mask prevent?
If d_model = 512 and there are 8 attention heads, what is d_head?
If you removed all attention layers but kept the FFN layers and residual connections, what would the model lose?
What is the correct order of operations in a standard decoder block?