Residual pathways

Residuals implement x_{l+1}=x_l+Sublayer(Norm(x_l)) ensuring gradients propagate even when nonlinearities saturate.

They parallel CNN ResNet lore but interplay uniquely with softmax attention stability.