Topic 11

Encoder-only masking objectives versus decoder-only next-token rollout

Vaswani et al. deliver a permutation-friendly protein—later teams merely chose which halves to bake and which causal masks to eternalise.

Math & statistics used here

  • Masked LM and next-token CE are both cross-entropy against categorical targets—gradients accumulate into embedding + attention weights.
  • Bidirectional masks remove strict lower-triangular structure; softmax still sums to one over keys not marked invalid.
  • Scaling-laws thinking: compute C ≈ N params × tokens seen (roughly)—linear algebra dominates both forward and backward passes.
  • Logits temperature T scales softmax input by 1/T—helps student/teacher KL matching in distillations using the same stack.

Checklist you can map to code

  • Encoder-only Transformer stacks discard decoder but keep bidirectional self-attention for Cloze-like denoising (BERT lineage).
  • Decoder-only models keep causal masks for autoregressive left-to-right training (GPT lineage).
  • Encoder–encoder style seq2seq (T5/BART style) repurposes masking noise inside both ends but shares parameter templates from the baseline paper.
  • Pre-training corpuses dictate capability; architectures decide interface—same dot-product primitives everywhere.
  • Scaling laws correlate compute, tokens, parameter counts—not specific head counts—later but rely on optimisation stability lessons from Sections 6.3.

When BERT debuted, reviewers emphasised masking random tokens forcing bidirectional context aggregation—impossible naively inside causal decoder stacks yet trivial once you delete autoregressive constraints on an encoder Tower.

GPT emphasised causal next-token modelling mirroring classical RNN LM training but massively parallelises across positions using masked attention—even though masking forbids glimpsing futures, GPUs still batch GEMMs cleanly.

Parameter sharing and tie-breakers matter: GPT-style vocab projections often tie embeddings; encoder-only discriminative stacks sometimes decouple CLS representations for classification heads—minor engineering divergence with major product impact.

Instruction tuning and reinforcement learning atop decoder-only stacks inherit identical tensor shapes as the 2017 decoder—students should realise innovation often lies in data + optimisation recipes not entirely fresh ops.

Interpretability parallels: probing layers for syntax in BERT matched earlier RNN probing but with finer resolution due to shortest path routing through stacked attention.

Software ecosystem note: Hugging Face `AutoModel*` classes unify config flags encoding architectural flavours—internals still reflect Figure 1 abstractions rewritten in Python.