Topic 12

O(n²) attention maps, approximation research, state-space resurgence, multimodal routing

The paper acknowledges parameter counts versus efficiency; fifteen years onwards the bottleneck is attention map materialisation on long modalities—engineering reality when shipping chat models.

Math & statistics used here

  • Full pairwise attention materialises scores of size T×T—counts as Θ(T²) memory/time for dense maps before kernel tricks.
  • Big-O is asymptotic—constants from fused FlashAttention kernels can beat naive quadratics at moderate T.
  • Approximate attention uses low-rank or sparse structure—think structured matrix families instead of dense GEMM.
  • Multimodal fusion is still tensor plumbing: patchify video frames or mel bins, then identical dot-product recipe.

Checklist you can map to code

  • Attention matrix materialisation dominates memory—not just flops—motivating block-sparse kernels and FlashAttention tiling.
  • Linear attention approximants trade expressivity for subquadratic chaining; quality varies.
  • State-space and selective SSM hybrids (periodic resurgence circa 2023–2024) reinterpret recurrence hardware-friendly manners.
  • Multimodal Transformers concatenate modalities plus modality-specific embeddings—positional encodings generalise irregularly sampled streams.
  • Production stacks mix retrieval, quantization, speculative decoding—all orthogonal layers wrapping same core GEMMs.

Section 6 training notes hinted at computational constraints; production today measures how many KV activations fit into HBM—not parameter counts alone. Serving frameworks page blocks of KV across GPU DRAM tiers or CPU offload—engineering discipline absent from foundational papers.

Research on sparse mixtures of experts multiplies routing overhead while keeping dot-product primitives—another angle on scaling orthogonal to asymptotic notation simplifications.

Audio and vision adapters inject patch embeddings then run identical attention—proof the paper’s skeletal description generalised broadly even if modality-specific preprocessing diverges.

Some teams regress small recurrent layers near attention for ultra-long contexts; others chunk contexts with retrieval augment—hybrid philosophies inherit trade-offs enumerated here.

Ethical/production angle: quadratic attention incentivises shorter prompts in cost-sensitive SaaS—you should connect pricing models to maths students rarely consider.

When reading follow-up summaries, classify proposals by whether they change computational graph depth, approximate pairwise interactions, or change data pipeline—helps avoid hype confusion.