Coding-first primer
Read Attention Is All You Need with twelve grounded topics
When we say “Transformer”, we mean the architecture—the repeatable pattern governing how tensors move through attention blocks, stacked layers and projections, residual paths through depth, dense feed-forward sublayers, and normalisation—not training objectives, scaling folklore, or any single product lineage.
Each topic explains why this architecture succeeded—not as a string of jargon, but as a stack you can sketch on paper: layered linear projections, softmax attention, summed mixes, residual adds, and nonlinearities, all evaluated on tensors. Parameter counts, FLOPs, and loss under your training objective remain as measurable here as in any differentiable model you already optimize. Readers who routinely move tensors should see attention, positional encodings, and encoder–decoder masks as explicit statistical machinery—not magic.
Twelve topic cards—each opens the deep dive
Topic 1
Language as tensors & order
How did sequence-to-sequence MT set up the Transformer problem?
Encoder–decoder frames map source sentences to latent memory decoders consume while generating targets.
Open deep dive →Topic 2
Language as tensors & order
Why do models still begin with token + position vectors?
Unicode normalisation, byte-pair encoding, and sentence-piece models determine which atomic units get ids.
Open deep dive →Topic 3
Language as tensors & order
Why was word order historically hard?
Bag-of-words destroys syntax: permutations become identical inputs unless you augment features.
Open deep dive →Topic 4
Recurrence, depth, and convolutions
Why did gated RNNs precede Transformers?
Backprop through time unfolds the graph T steps; Jacobian spectra multiply.
Open deep dive →Topic 5
Recurrence, depth, and convolutions
Convolutions stacked depth for context—what was missing?
Convolutional filters see k nearby tokens unless you deepen the network or dilate kernels.
Open deep dive →Topic 6
Attention machinery
Attention as differentiable, sparse-ish information retrieval
Alignment scores decide how strongly each encoder position participates in updating the decoder context.
Open deep dive →Topic 7
Attention machinery
Q, K, V: organising matmul-friendly attention
Queries index; keys advertise content addresses; values carry payloads mixed by weights.
Open deep dive →Topic 8
Attention machinery
Why replicate attention in parallel instead of widening one head?
Attention heads specialise on syntax, lexical repeat, positional bias, pronoun linkage—empirically not guaranteed but frequently observed.
Open deep dive →Topic 9
Full encoder–decoder shell
Adding order without resurrecting recurrence
Additive encodings hinge on broadcasting across sequence positions with distinct frequency bands.
Open deep dive →Topic 10
Full encoder–decoder shell
Three attention flavours in one stack diagram
Encoder self-attention attends left-right freely over source tokens (subject to padding masks).
Open deep dive →Topic 11
From the paper forward
BERT, GPT, T5… same atoms, swapped training recipes
Encoder-only Transformer stacks discard decoder but keep bidirectional self-attention for Cloze-like denoising (BERT lineage).
Open deep dive →Topic 12
From the paper forward
Costs, hybrids, multimodal workloads
Attention matrix materialisation dominates memory—not just flops—motivating block-sparse kernels and FlashAttention tiling.
Open deep dive →
Where the building blocks from the overview enter the roadmap
The introductory paragraphs sketch one architecture story: tensors percolate through stacked attention sublayers plus feed-forward slabs, residuals, normalisation; softmax turns scores into mixtures; positional encodings and masks inject order and causal structure. Cross-check that story here (primary topic ≠ only place concepts appear later).
| Idea | Primary topic | What to read there |
|---|---|---|
| Tensors over batch, time, width — minibatches as (batch, time, dim) grids | Topic 1Seq2seq encoders, decoders, and why recurrence blocked parallel training | Frames why Transformers crave wide matmuls over padded sequences vs serial RNN unfolds. |
| Linear projections from vocab space — embeddings, lookup tables tied to softmax | Topic 2Tokenisation, one-hot bottlenecks, and dense embeddings that feed every layer | Token ids → rows of E ∈ ℝ^{|V|×d}; stacking rows yields X fed into attention / FFN stacks. |
| Depth, gradients, residuals, normalization — stabilising stacks before attention dominates | Topic 4Gradient highways, exploding norms, and the limits recurrence hit before 2017 | Explains why additive skips / layer norms recur when depth grows—directly echoed in Transformer blocks. |
| Softmax weights as distributions; summed mixes Σ αᵢ vᵢ as differentiable pooling | Topic 6From alignment weights to expected value reads out of encoder memory | Turns logits into stochastic weights — the “summed mixes” intuition before Q/K/V matmul notation. |
| Scaled dot-product attention softmax(Q Kᵀ / √d_k) V + causal/padding logits | Topic 7Scaled dot-product attention as the algebraic heart of Transformer blocks | Core attention algebra; masking uses −∞ on forbidden positions prior to softmax rows. |
| Multi-head stacking, narrower d_k/h, concatenate + W_O re-projection | Topic 8Subspaces per head diversify relational patterns learnt jointly | Where parallel projections diversify heads before the mixer hands off to dense FFN sublayers. |
| Positional encoding — injecting order additive (learned vs sinusoidal) | Topic 9Sinusoidal encodings versus learned embeddings and relative extensions | Break permutation symmetry so softmax attention distinguishes token order explicitly. |
| Encoder–decoder shell — MHSA · FFN · LayerNorm/residual repeats; causal + cross-attn masks | Topic 10Encoder self-attention, masked decoder self-attention, encoder–decoder attention | Glue from Figure 1 made concrete: causal decoder masks preserve AR likelihood while cross‑attn aligns languages. |
| Training objectives vs architecture — masking patterns, logits, CE gradients | Topic 11Encoder-only masking objectives versus decoder-only next-token rollout | Same tensor stack specialised by objective (BERT-style MLM vs GPT-style causal CE vs seq2seq denoise). |
| Parameter counts · FLOPs · memory footprints — quadratic T² attention realism | Topic 12O(n²) attention maps, approximation research, state-space resurgence, multimodal routing | Quantifies computational story behind “measurable loss / profiled matmul” language in the overview. |
Earlier topics usually mention these ideas too; rows point to the richest treatment in this guide.