Scaled dot-product attention

The core operator is softmax((Q K^⊤)/√(d_k)) V. Scaling prevents extremely peaked distributions that saturate gradients.

It generalises additive attention while mapping cleanly onto matrix multiplications hardware accelerates.