Scaled dot-product attention
The core operator is softmax((Q K^⊤)/√(d_k)) V. Scaling prevents extremely peaked distributions that saturate gradients.
It generalises additive attention while mapping cleanly onto matrix multiplications hardware accelerates.