Layer normalisation placement
"Pre-LN" applies normalisation prior to residual sublayers improving optimisation in deep transformers.
It keeps logits well scaled so attention softmaxes remain informative mile deep.
"Pre-LN" applies normalisation prior to residual sublayers improving optimisation in deep transformers.
It keeps logits well scaled so attention softmaxes remain informative mile deep.