Layer normalisation placement

"Pre-LN" applies normalisation prior to residual sublayers improving optimisation in deep transformers.

It keeps logits well scaled so attention softmaxes remain informative mile deep.