Topic 6

From alignment weights to expected value reads out of encoder memory

Before self-attention became the whole backbone, additive attention pooled encoder states conditioned on the decoder hidden vector—Bahdanau-style alignment is pedagogically simpler softmax weighting.

Math & statistics used here

  • Softmax map s ↦ exp(s−max)/∑ exp normalises logits to a convex combination—weights live on the probability simplex.
  • Weighted sum ∑ α_j v_j is the expectation E[V] under discrete α—a linear operator in V once α is fixed.
  • Derivative of softmax interacts with centred logits—vanishing softmax happens when logits blow up; temperature rescales logits.
  • Dot-product scoring q·k relates to Euclidean geometry: cosine similarity ignores norm if you normalize first.

Checklist you can map to code

  • Alignment scores decide how strongly each encoder position participates in updating the decoder context.
  • Softmax converts scores into stochastic weights even though no sampling occurs—deterministic expectation instead.
  • Attention lets models copy rare proper nouns from source to target—a behaviour MT engineers prized.
  • Generalising attention to triples Q/K/V abstracts ‘who asks, what is indexed, what content is fetched’.
  • Dropout on attention weights (mentioned in Section 6) behaves like stochastic sparsification during training.

Read this block as statistical machinery spelled out numerically: every softmax row allocates probability mass across keys, turning values into convex combinations—you can sanity-check logits and entropies the same way you debug classifiers.

Early seq2seq + attention fused two intuitions: compress the source into a latent tape, then let the decoder rewind that tape selectively. Scoring each encoder timestep with respect to current decoder context produced attention weights—you can visualize them heat-mapped across German–English sentences.

Mathematically, nothing forbids swapping the scoring function—additive MLP energies versus dot products. Dot products simply map to efficient matmuls on accelerators once queries and keys live in shared embedding space with dimension d_k.

Framing softmax outputs as categorical distributions reinforces statistical thinking: gradients push probability mass toward positions that drained the loss, analogous to expectation-maximisation intuitions albeit end-to-end learned.

Self-attention extends the same pooling pattern but lets every position attend to every other position in parallel—changing who issues queries and keys from encoder–decoder asymmetry to per-token symmetry until masks impose directionality.

Engineering notebooks often print attention entropy; collapsed one-hot-ish maps may signal poor initialisation whereas diffuse maps imply uncertain alignment—still useful telemetry when aligning model behaviour with bilingual dictionaries.