Topic 11
Encoder-only masking objectives versus decoder-only next-token rollout
(Vaswani et al., 2017) already trains the decoder with a vocabulary softmax and cross-entropy on the next target token for translation. Later systems (BERT, GPT-style LMs, T5-style denoising) mostly swap objectives and attention masks on the same stack—this topic sorts those recipes and names the probability tools they share.
Math & statistics used here
- Vocabulary head: final linear map h ↦ logits ∈ ℝ^{|V|}; row-wise softmax maps logits to a probability vector over next-token (or masked-position) outcomes.
- Cross-entropy (CE) on a one-hot target id equals −log p for the correct class; sums over positions give a negative log-likelihood flavour identical in spirit to the translation decoder in Attention Is All You Need and to later LM pre-training.
- Autoregressive target modelling writes P(y) = ∏_t P(y_t | y_{<t}, source); causal (lower-triangular) masks on the decoder keep that factorisation honest under teacher forcing.
- The seq2seq model in Attention Is All You Need is already a conditional categorical predictor: probabilities are over the next target token given the source—not an optional add-on invented by BERT or GPT branding.
- Masked LM (BERT-style): hide or corrupt tokens and predict them using bidirectional self-attention—still CE on a categorical label, but the supervised event is “fill the hole,” not a single left-to-right chain at every step.
- Causal LM (GPT-style): standard next-token CE with a causal mask so position t never attends to future targets—many positions train in parallel because the mask encodes order.
- Masked LM and next-token CE share the same CE gradient family onto embeddings and attention weights; objectives differ in which positions emit a loss and which positions can see each other.
- Scaling-laws shorthand: compute still tracks roughly parameters × tokens seen; auxiliary losses (NSP-era BERT, etc.) add terms but rarely replace the core softmax/CE backbone.
- Temperature 1/T on logits sharpens or flattens softmax masses—useful in distillation or sampling, same vocabulary head as above.
Checklist you can map to code
- Encoder-only Transformer stacks discard decoder but keep bidirectional self-attention for Cloze-like denoising (BERT lineage).
- Decoder-only models keep causal masks for autoregressive left-to-right training (GPT lineage).
- Encoder–encoder style seq2seq (T5/BART style) repurposes masking noise inside both ends but shares parameter templates from the baseline paper.
- Pre-training corpuses dictate capability; architectures decide interface—same dot-product primitives everywhere.
- Scaling laws correlate compute, tokens, parameter counts—not specific head counts—later but rely on optimisation stability lessons from Sections 6.3.
In , the flagship task is machine translation, not yet web-scale “foundation model” pre-training. Still, the decoder emits, at each target position, logits over a finite vocabulary; a row-wise softmax turns those logits into a categorical distribution; training minimises cross-entropy against the reference translation under standard teacher forcing. That is already “fit a conditional probability model for the next revealed target token,” with probabilities normalised by softmax—before BERT- or GPT-branded objectives existed.
What changed in later work is less “whether Transformers use probability” and more which positions you supervise, which contexts each position may attend to, and which noise process builds the input—for example masked language modelling with bidirectional encoder attention versus causal (left-to-right) language modelling, or span corruption / infilling in encoder–decoder T5-style setups. Attention masks enforce visibility rules (often by sending forbidden logits to −∞ before softmax); auxiliary heads add extra CE terms; the core vocabulary head → softmax → negative log-likelihood pattern nonetheless remains the familiar training workhorse.
When BERT debuted, reviewers emphasised masking random tokens forcing bidirectional context aggregation—impossible naively inside causal decoder stacks yet trivial once you delete autoregressive constraints on an encoder Tower.
GPT emphasised causal next-token modelling mirroring classical RNN LM training but massively parallelises across positions using masked attention—even though masking forbids glimpsing futures, GPUs still batch GEMMs cleanly.
Parameter sharing and tie-breakers matter: GPT-style vocab projections often tie embeddings; encoder-only discriminative stacks sometimes decouple CLS representations for classification heads—minor engineering divergence with major product impact.
Instruction tuning and reinforcement learning atop decoder-only stacks inherit identical tensor shapes as the decoder in —students should realise innovation often lies in data + optimisation recipes not entirely fresh ops.
Interpretability parallels: probing layers for syntax in BERT matched earlier RNN probing but with finer resolution due to shortest path routing through stacked attention.
Software ecosystem note: Hugging Face `AutoModel*` classes unify config flags encoding architectural flavours—internals still reflect Figure 1 abstractions rewritten in Python.