Coding-first primer

Read Attention Is All You Need with twelve grounded topics

When we say “Transformer”, we mean the architecture—the repeatable pattern governing how tensors move through attention blocks, stacked layers and projections, residual paths through depth, dense feed-forward sublayers, and normalisation—not training objectives, scaling folklore, or any single product lineage.

Each topic explains why this architecture succeeded—not as a string of jargon, but as a stack you can sketch on paper: layered linear projections, softmax attention, summed mixes, residual adds, and nonlinearities, all evaluated on tensors. Parameter counts, FLOPs, and loss under your training objective remain as measurable here as in any differentiable model you already optimize. Readers who routinely move tensors should see attention, positional encodings, and encoder–decoder masks as explicit statistical machinery—not magic.

Twelve topic cards—each opens the deep dive

Where the building blocks from the overview enter the roadmap

The introductory paragraphs sketch one architecture story: tensors percolate through stacked attention sublayers plus feed-forward slabs, residuals, normalisation; softmax turns scores into mixtures; positional encodings and masks inject order and causal structure. Cross-check that story here (primary topic ≠ only place concepts appear later).

IdeaPrimary topicWhat to read there
Tensors over batch, time, width — minibatches as (batch, time, dim) gridsTopic 1Seq2seq encoders, decoders, and why recurrence blocked parallel trainingFrames why Transformers crave wide matmuls over padded sequences vs serial RNN unfolds.
Linear projections from vocab space — embeddings, lookup tables tied to softmaxTopic 2Tokenisation, one-hot bottlenecks, and dense embeddings that feed every layerToken ids → rows of E ∈ ℝ^{|V|×d}; stacking rows yields X fed into attention / FFN stacks.
Depth, gradients, residuals, normalization — stabilising stacks before attention dominatesTopic 4Gradient highways, exploding norms, and the limits recurrence hit before 2017Explains why additive skips / layer norms recur when depth grows—directly echoed in Transformer blocks.
Softmax weights as distributions; summed mixes Σ αᵢ vᵢ as differentiable poolingTopic 6From alignment weights to expected value reads out of encoder memoryTurns logits into stochastic weights — the “summed mixes” intuition before Q/K/V matmul notation.
Scaled dot-product attention softmax(Q Kᵀ / √d_k) V + causal/padding logitsTopic 7Scaled dot-product attention as the algebraic heart of Transformer blocksCore attention algebra; masking uses −∞ on forbidden positions prior to softmax rows.
Multi-head stacking, narrower d_k/h, concatenate + W_O re-projectionTopic 8Subspaces per head diversify relational patterns learnt jointlyWhere parallel projections diversify heads before the mixer hands off to dense FFN sublayers.
Positional encoding — injecting order additive (learned vs sinusoidal)Topic 9Sinusoidal encodings versus learned embeddings and relative extensionsBreak permutation symmetry so softmax attention distinguishes token order explicitly.
Encoder–decoder shell — MHSA · FFN · LayerNorm/residual repeats; causal + cross-attn masksTopic 10Encoder self-attention, masked decoder self-attention, encoder–decoder attentionGlue from Figure 1 made concrete: causal decoder masks preserve AR likelihood while cross‑attn aligns languages.
Training objectives vs architecture — masking patterns, logits, CE gradientsTopic 11Encoder-only masking objectives versus decoder-only next-token rolloutSame tensor stack specialised by objective (BERT-style MLM vs GPT-style causal CE vs seq2seq denoise).
Parameter counts · FLOPs · memory footprints — quadratic T² attention realismTopic 12O(n²) attention maps, approximation research, state-space resurgence, multimodal routingQuantifies computational story behind “measurable loss / profiled matmul” language in the overview.

Earlier topics usually mention these ideas too; rows point to the richest treatment in this guide.