Topic 1

Seq2seq encoders, decoders, and why recurrence blocked parallel training

Neural machine translation was the flagship benchmark in 2017; attention augmented RNNs still marched token-by-token, which is the production pain the paper attacks head-on.

Math & statistics used here

Treat a minibatch as a tensor of shape (batch, time, dim); matmuls are how you advance hidden states layer by layer.
Conditional LM objective: maximise ∏ P(y_t | y_<t, x) — log-softmax losses are sums over these terms.
Chain rule/backprop length grows with timestep depth; recurrence is why parallelism was scarce before pure attention stacks.
Residual thought experiment: stacking depth without skips worsens Jacobian products—later Transformers inherit skip tricks from CNN/RNN lore.

Checklist you can map to code

Encoder–decoder frames map source sentences to latent memory decoders consume while generating targets.
Recurrent cores update a hidden state h_t = f(h_{t-1}, x_t), creating an inherently sequential training graph.
Convolutional seq2seq models increased parallelism but still needed stacked layers to grow receptive fields.
Transformer removes timestep recurrence by letting every position attend to every other position in O(1) depth per layer—at the cost of O(n²) pairwise comparisons.
BLEU/WMT numbers in Table 2 reward stable training at scale; that is the engineering bar the architecture optimises for.

Early seq2seq translators compressed the entire source sentence into one or a few hidden vectors, which bottlenecked long sentences. Attention let decoders peek back at every encoder state, which is the conceptual bridge from RNN-era papers to the fully attentive design in Vaswani et al.

However, even with attention, the encoder and decoder RNNs still run serially across time during training. GPUs thrive on large matrix multiplies that can be issued independently; recurrence forces micro-kernels that wait on the previous timestep’s finish, which is the practical ‘GPU starvation’ story behind Section 1’s motivation.

Convolutions on sequences mitigated the issue by stacking layers so wider contexts accumulated, but very long dependencies still require many layers or dilated patterns. The Transformer instead grants global mixing in a single sublayer—then repeats that block L times—trading off FLOPs for simplicity of pattern.

Engineering-wise, think about batching: RNN training passes B sequences of length T with a loop; Transformer training often forms (B, T, d) tensors and runs a handful of huge matmuls plus softmax. That is the shift the paper’s wall-clock claims implied for 2017 hardware.

When you read Section 6, connect those training cost remarks back here: they are not theoretical niceties but data centre economics.