Efficient attention lineage

Linear attention variants approximate softmax kernels while sparse patterns attend to nearest lexical neighbours sliding windows mixtures.

Staying atop this literature informs long-context adapters versus brute-force FlashAttention kernels engineering.