Self-attention intuition

Projections derive QKV triples via learned linear maps. Compatibility scores dictate how much mass each value contributes relative to neighbours.

Stacking heads and layers repeats this mixing so higher units capture progressively abstract relations.