Sequential RNN bottleneck
The hidden-state recurrence h_t=f(h_{t-1}, x_t) enforces causal dependencies that collide with SIMD-friendly kernels. Techniques like cudnn LSTM optimise kernels but rarely remove timestep ordering.
Self-attention flips this by computing all pairwise logits in parallel subject only to masking rules.