Topic 3

Bag-of-words blind spots plus the first recurrence fix

Self-attention is permutation-equivariant without positional encodings; understanding earlier failures clarifies why order must be injected deliberately.

Math & statistics used here

Bag-of-words can be written as multiset counts—a linear map from counts to logits, but permutation-invariant by construction.
RNN recurrence h_t = tanh(W h_{t-1} + U x_t) is a nonlinear dynamical system; Jacobians ∂h_t/∂h_{t-1} multiply along time.
Self-attention without positions is permutation-equivariant: swapping rows of X leaves outputs permuted identically.
When you encode order with positions or recurrence, you are breaking symmetry with extra structure—not magic.

Checklist you can map to code

Bag-of-words destroys syntax: permutations become identical inputs unless you augment features.
N-grams add shallow locality but explode vocabulary and still miss long-range contrasts.
Recurrent nets maintain a compressed state trajectory h_t summarising prefixes—order enters through transition dynamics.
Hidden states trade interpretability for capacity: one vector must remember everything pertinent so far.
Teaching RNNs to copy rare tokens exposes how quickly hidden states saturate without attention.

Classic text classifiers summed word features; blatantly harmful for translation because active/passive constructions swap meaning while producing the same multiset of words. Engineers remedied partially with hand-crafted POS features or character convolutions—both signs the model lacked flexible order-aware mixing.

RNNs made order native by iterating: theoretically they are universal approximators of dynamical systems, practically modest widths forget details. Attention emerged because even LSTMs strained on very long sentences when alignment needed to leap distant phrases.

From an implementation standpoint, looping over time steps is cumbersome when you batch heterogeneous lengths; pack/pad idioms dominate RNN training code. Transformers ditch the per-step recurrence loop in favour of mask-aware batched softmax—trading predictable serial structure for chunked global interactions.

Read this alongside Section 5.4 on attention visualisations: positional encodings + attention retrieve subject–verb chunks across moderate distances—the behaviour bag-of-words could never spontaneously learn.

The coding lesson: permutation equivariance is a symmetry you must break thoughtfully; recurrence broke it implicitly, positional encodings break it explicitly in the Transformer pipeline.