Topic 4

Gradient highways, exploding norms, and the limits recurrence hit before 2017

LSTMs and GRUs exist because vanilla RNN Jacobians multiplied across depth-in-time explode or vanish; Transformers shorten the recurrent depth to zero layers of recurrence but lengthen depth-in-layer.

Math & statistics used here

Vanishing gradients: long products of Jacobian norms → ∂L/∂h_1 ≈ 0; exploding is the converse—clip thresholds bound norms.
Additive cell states behave like residual streams; gates pick continuous convex combinations bounded in [0,1].
Same variance-scaling rationale as √(d_k) in attention: keep pre-softmax dot products inside a stable dynamic range.
Layer norms standardise mean/variance per vector—keeps matmul inputs from drifting as depth grows.

Checklist you can map to code

Backprop through time unfolds the graph T steps; Jacobian spectra multiply.
Gating creates additive pathways similar in spirit to residual shortcuts—signals bypass nonlinear squashing selectively.
Long-range entropy still bleeds away; attentive readouts became necessary for alignment-heavy tasks.
Exploding gradients are mitigated by clipping; vanishing gradients require architecture changes or attention.
Modern LLM stacks repurpose residual + layer norm lore learned from stabilising RNN/CNN hybrids.

Vanishing gradients mean early tokens stop influencing loss once states saturate; exploding gradients blow activations beyond float range unless clipped. Hochreiter and Schmidhuber’s LSTM tackled this via cell states with linear travels protected by adaptive gates—a software analogy is carefully placed `if` branches controlling information flow.

GRUs compress the mechanics into two gates; engineers often swap them interchangeably unless absolute peak accuracy demands LSTM granularity. Both architectures still bottleneck on parallel training because time is an outer dependency.

When Vaswani et al. emphasise residual connections around each sublayer (Figure 1 + Equation surrounding residual block), recognise the conceptual continuity with LSTM gradients: residuals provide uninterrupted routes for tensors to propagate with identity-ish Jacobians.

Layer normalisation parallels batch-free stabilisers that recurrence-friendly networks already preferred; Transformer training stacks tens of residuals with pre-norm tweaks in descendants, reflecting hard-won lore from stabilising sequential models.

So the pedagogical arc is: recurrence solved order but hurt parallelism + long memory; attention solved alignment but existed as a periphery bolt-on; Transformer integrates attention everywhere while borrowing stabilisation idioms gated RNN stacks popularised.

This mental model prevents you from thinking Transformers magically removed optimisation issues—they reorganised depth.