Topic 8

Subspaces per head diversify relational patterns learnt jointly

Each head executes its own softmax attention with thinner projections; concatenation restores width so feed-forward layers can mix features—Section 3.2.2 documents the reshape trick.

Math & statistics used here

  • Head split is block-diagonal-ish linear projections: concat heads then W_O is another matmul reshaping representation.
  • Per-head dimension d_k = d_model/h keeps total parameter budget comparable while narrowing dot-product spectra.
  • Outer-product/low-rank views: each head is a selective bilinear mixer before concatenation restores width.
  • Jacobian structure: concatenating heads duplicates gradient paths analogous to grouped conv branches.

Checklist you can map to code

  • Attention heads specialise on syntax, lexical repeat, positional bias, pronoun linkage—empirically not guaranteed but frequently observed.
  • Splitting projections reduces pairwise dot-product saturation by operating in narrower d_k dimensions per head.
  • Concat • W_O reasserts expressive mixing before FFN nonlinearities widen again.
  • Head dropout (Section 6) decorrelates reliance on any particular head.
  • Runtime trades h heads against memory bandwidth—the same GEMM tiling discussion as grouped convolutions surfaces here.

Single-head scaled dot-product attention can represent many relations yet empirically benefits from mixtures reminiscent of ensemble attention or mixture-of-experts—multi-head parallels that intuition by learning complementary subspaces glued after concatenation.

Dimensional analysis: splitting d_model into h heads yields d_k = d_model/h if balanced; narrower spaces reduce pair-wise dot dominance while maintaining similar parameter counts through separate W projections.

Some heads correlate with constituency-like spans; others mimic positional offsets; probing papers formalise behaviours beyond anecdotal heatmaps—but from an engineering vantage, treating heads as latent feature detectors suffices for implementing models.

Ablation takeaway: shrinking head counts often hurts perplexity plateau more than mildly shrinking depth because heads diversify gradient routing early in training when optimisation landscape is nastier.

Downstream quantization sometimes couples heads asymmetrically—knowledge-distillation folklore emphasises aligning student heads indirectly via teacher logits instead of brute force per-head mimicry.

When profiling memory, realise each head duplicates attention maps of size O(n²) unless checkpointing or slicing across sequence blocks—hardware-aware training libraries chunk flash attention kernels per head batch.