Topic 9
Sinusoidal encodings versus learned embeddings and relative extensions
Section 3.5 injects deterministic sin/cos features so even fixed encodings extrapolate unseen lengths moderately and broadcast unique phase signatures per dimension.
Math & statistics used here
- Additive PE is vector addition PE(pos) ∈ ℝ^d summed with embeddings—that is plain ℝ^d arithmetic before attention sees X.
- sin/cos pairs create distinct phase signatures across dimensions; Fourier intuition helps—you mix frequencies into coordinates.
- Relative rotation ideas (RoPE later) are complex-number algebra in disguise; sin/cos are real/imag pairs.
- Layer norm after addition rescales combined signal—same stabilisation story as elsewhere in deep nets.
Checklist you can map to code
- Additive encodings hinge on broadcasting across sequence positions with distinct frequency bands.
- Even dimensions use sine, odds use cosine; wavelengths form geometric progression across dimension index.
- Learned embeddings are permissible and often used in GPT-style stacks; trade-offs hinge on extrapolation beyond training lengths.
- Relative positional schemes (later literature) refactor inductive biases but inherit motivation from Fourier-like mixing.
- Layer normalisation interacts with positional scale—initialization keeps combined embeddings within stable numeric bands.
Without positional info, swapping tokens yields identical transformer outputs excluding dropout noise—fatal for NLP. Adding vectors p_i to embeddings x_i embeds temporal coordinates into the embedding space before attention—which remains permutation-equivariant except for injected p offsets.
Sinusoidal design rationale: deterministic functions with smooth frequency spectrum let models learn attention patterns sensitive to displacement because linear transformations of sine arguments produce phase shifts resembling relative offsets—later Rotary (RoPE) and ALiBi formalise sharper relative guarantees.
Implementation detail: sinusoids use engineering constants (10,000 exponent base) clipping frequencies; sloppy copies from blogs often mishandle dimensional pairing—match the parity split exactly or debugging becomes agonising.
Learnable positional embeddings treat positions as categorical ids analogous to tokens; phenomenal when training distribution covers target lengths poorly but brittle when inference extends far beyond—the paper chose sinusoidal partially to soften longer sequence degradation on WMT-ish lengths.
Downstream coder tip: positional dropout or jitter occasionally regularises positional overfitting akin to augmentation—rare but documented in wav2vec-style hybrids.
When you revisit Section 6 training setup, positional encoding dropout is subtly tied to augmentation philosophy—preventing brittle reliance on singular frequency channels.