Multi-head attention
Each head learns its own low-rank attention subspaces. Concatenating heads lets a layer represent syntactic cues, topical overlap, rare token alignments concurrently.
Typical Transformer widths allocate many heads balancing expressivity versus parameter cost.