Topic 10
Encoder self-attention, masked decoder self-attention, encoder–decoder attention
Figure 1 neatly annotates overlapping blocks—inference code branches on mask tensors while sharing identical dot-product kernels.
Math & statistics used here
- Causal mask enforces a lower-triangular attention pattern; matrix view is applying softmax only to allowed columns.
- Padding masks zero out keys that are not real tokens—equivalent to −∞ logits on those columns before softmax.
- Cross-attention uses Q from decoder, K/V from encoder—shapes must align on d_k for QKᵀ.
- Autoregressive likelihood factorisation is strict probability: no peeking at future y_t when training with teacher forcing.
- Pre-norm Transformer update (conceptual): x ← x + Sublayer(LayerNorm(x)) for MHSA then again for FFN—keeps softmax logits in range while residuals carry gradients across depth.
Checklist you can map to code
- Encoder self-attention attends left-right freely over source tokens (subject to padding masks).
- Decoder self-attention masks future targets to preserve autoregressive likelihood during training.
- Encoder–decoder attention lets decoder queries inspect final encoder representations—alignment lives here analogous to older attention RNN hybrids.
- Layer norm placements (pre/post) subtly shift optimisation; implementations matter for mixed precision convergence.
- KV caching during autoregressive inference stores frozen encoder states/recomputed decoder KV across steps—engineering detail absent in the paper yet vital for latency.
Encoder stacks build deeply contextual token representations insensitive to causal direction—beneficial encoding entire French sentence semantics before emitting English left-to-right. Each layer alternates multi-head attention with position-wise feed-forward two-layer perceptrons plus residuals—think ‘token mixer’ then ‘channel mixer’ analogous to ConvNeXt narration.
Decoder stacks intertwine causal masks with cross attention: causal mask ensures token i cannot peek at targets j>i when minimising teacher-forced likelihood; cross attention attends over full encoder breadth because source context is observable entirely when translating.
Framework code typically composes booleans broadcasting over attention logits; identical CUDA kernels specialised for causal versus full patterns exist—study them when optimising GPT versus BERT workloads.
Inference diverges dramatically: greedy decoding consumes previously generated logits; KV caches store prior keys/values to avoid quadratic recomputation—a concept already implicit behaviourally although later papers formalise paging for long chats.
Residual plus layer norm interplay ensures deep stacks converge; mishandling half-precision casts around softmax sometimes creates NaNs first visible in decoder masking corner cases—engineering war story mirroring recurrence clipping days.
Bridge to speech or vision hybrids: encoder–decoder survives in Whisper-like speech translation where acoustic encoders mimic text encoders—with identical attention plumbing sans recurrence.