Multi-head attention

Each head learns its own low-rank attention subspaces. Concatenating heads lets a layer represent syntactic cues, topical overlap, rare token alignments concurrently.

Typical Transformer widths allocate many heads balancing expressivity versus parameter cost.