Read Attention Is All You Need with twelve grounded topics

When we say “Transformer”, we mean the architecture—the repeatable pattern governing how tensors move through attention blocks, stacked layers and projections, residual paths through depth, dense feed-forward sublayers, and normalisation—not training objectives, scaling folklore, or any single product lineage.

Each topic explains why this architecture succeeded—not as a string of jargon, but as a stack you can sketch on paper: layered linear projections, softmax attention, summed mixes, residual adds, and nonlinearities, all evaluated on tensors. Parameter counts, FLOPs, and loss under your training objective remain as measurable here as in any differentiable model you already optimize. Readers who routinely move tensors should see attention, positional encodings, and encoder–decoder masks as explicit statistical machinery—not magic.

Where the building blocks from the overview enter the roadmap

The introductory paragraphs sketch one architecture story: tensors percolate through stacked attention sublayers plus feed-forward slabs, residuals, normalisation; softmax turns scores into mixtures; positional encodings and masks inject order and causal structure. Cross-check that story here (primary topic ≠ only place concepts appear later).

Idea	Primary topic	What to read there
Tensors over batch, time, width — minibatches as (batch, time, dim) grids	Topic 1Seq2seq encoders, decoders, and why recurrence blocked parallel training	Frames why Transformers crave wide matmuls over padded sequences vs serial RNN unfolds.
Linear projections from vocab space — embeddings, lookup tables tied to softmax	Topic 2Tokenisation, one-hot bottlenecks, and dense embeddings that feed every layer	Token ids → rows of E ∈ ℝ^{\|V\|×d}; stacking rows yields X fed into attention / FFN stacks.
Depth, gradients, residuals, normalization — stabilising stacks before attention dominates	Topic 4Gradient highways, exploding norms, and the limits recurrence hit before 2017	Explains why additive skips / layer norms recur when depth grows—directly echoed in Transformer blocks.
Softmax weights as distributions; summed mixes Σ αᵢ vᵢ as differentiable pooling	Topic 6From alignment weights to expected value reads out of encoder memory	Turns logits into stochastic weights — the “summed mixes” intuition before Q/K/V matmul notation.
Scaled dot-product attention softmax(Q Kᵀ / √d_k) V + causal/padding logits	Topic 7Scaled dot-product attention as the algebraic heart of Transformer blocks	Core attention algebra; masking uses −∞ on forbidden positions prior to softmax rows.
Multi-head stacking, narrower d_k/h, concatenate + W_O re-projection	Topic 8Subspaces per head diversify relational patterns learnt jointly	Where parallel projections diversify heads before the mixer hands off to dense FFN sublayers.
Positional encoding — injecting order additive (learned vs sinusoidal)	Topic 9Sinusoidal encodings versus learned embeddings and relative extensions	Break permutation symmetry so softmax attention distinguishes token order explicitly.
Encoder–decoder shell — MHSA · FFN · LayerNorm/residual repeats; causal + cross-attn masks	Topic 10Encoder self-attention, masked decoder self-attention, encoder–decoder attention	Glue from Figure 1 made concrete: causal decoder masks preserve AR likelihood while cross‑attn aligns languages.
Training objectives vs architecture — masking patterns, logits, CE gradients	Topic 11Encoder-only masking objectives versus decoder-only next-token rollout	Same tensor stack specialised by objective (BERT-style MLM vs GPT-style causal CE vs seq2seq denoise).
Parameter counts · FLOPs · memory footprints — quadratic T² attention realism	Topic 12O(n²) attention maps, approximation research, state-space resurgence, multimodal routing	Quantifies computational story behind “measurable loss / profiled matmul” language in the overview.

Earlier topics usually mention these ideas too; rows point to the richest treatment in this guide.

Read Attention Is All You Need with twelve grounded topics

How did sequence-to-sequence MT set up the Transformer problem?

Why do models still begin with token + position vectors?

Why was word order historically hard?

Why did gated RNNs precede Transformers?

Convolutions stacked depth for context—what was missing?

Attention as differentiable, sparse-ish information retrieval

Q, K, V: organising matmul-friendly attention

Why replicate attention in parallel instead of widening one head?

Adding order without resurrecting recurrence

Three attention flavours in one stack diagram

BERT, GPT, T5… same atoms, swapped training recipes

Costs, hybrids, multimodal workloads

Where the building blocks from the overview enter the roadmap

Read Attention Is All You Need with twelve grounded topics

Twelve topic cards—each opens the deep dive

How did sequence-to-sequence MT set up the Transformer problem?

Why do models still begin with token + position vectors?

Why was word order historically hard?

Why did gated RNNs precede Transformers?

Convolutions stacked depth for context—what was missing?

Attention as differentiable, sparse-ish information retrieval

Q, K, V: organising matmul-friendly attention

Why replicate attention in parallel instead of widening one head?

Adding order without resurrecting recurrence

Three attention flavours in one stack diagram

BERT, GPT, T5… same atoms, swapped training recipes

Costs, hybrids, multimodal workloads

Where the building blocks from the overview enter the roadmap