Residual pathways
Residuals implement x_{l+1}=x_l+Sublayer(Norm(x_l)) ensuring gradients propagate even when nonlinearities saturate.
They parallel CNN ResNet lore but interplay uniquely with softmax attention stability.
Residuals implement x_{l+1}=x_l+Sublayer(Norm(x_l)) ensuring gradients propagate even when nonlinearities saturate.
They parallel CNN ResNet lore but interplay uniquely with softmax attention stability.