Topic 2
Tokenisation, one-hot bottlenecks, and dense embeddings that feed every layer
Section 3.4’s input embedding section is tiny on the page but huge in systems: you must turn Unicode text into a finite vocabulary of ids before any matrix multiply happens.
Math & statistics used here
- Embedding lookup is indexing rows of E ∈ ℝ^{|V|×d}: one-hot e_i times E selects row i.
- Stacked token embeddings form X ∈ ℝ^{T×d}; every attention layer is ultimately matrix multiply on tensors like this.
- Weight tying reuses transpose structure between embedding and softmax—same linear map, fewer parameters.
- Expectation intuition: softmax over vocabulary is literally a categorical distribution; CE loss is −log p(correct class).
Checklist you can map to code
- Unicode normalisation, byte-pair encoding, and sentence-piece models determine which atomic units get ids.
- One-hot vectors are sparse indicators; learned embedding tables map token ids to dense rows in ℝ^d_model.
- Weight tying (mentioned in Section 3.4) reuses the embedding matrix with the pre-softmax projection to cut parameters.
- Semantic geometry—similar words landing nearby—is an emergent property of training, not a hand-written rule.
- Static shapes in frameworks mean padding, masks, and batching all happen at the token-id level before attention.
Computers never ingest characters directly; frameworks expect integer indices into an embedding table E ∈ ℝ^{|V| × d}. Each row is a trainable parameter vector; stacking rows for a sentence forms the matrix that enters the first encoder layer in Figure 1.
Tokenisation strategy changes FLOPs: smaller vocabularies yield longer sequences; subword methods trade rare word fragmentation for manageable |V|. That choice ripples into attention cost because complexity grows with sequence length squared.
One-hot encoding is the conceptual starting point—indicator vector with a single 1—but it is wasteful and offers no parameter sharing. Embeddings compress that space and let similar tokens share statistical strength, which is how ‘bank’ (river) and ‘bank’ (finance) can diverge later even if they share a subword piece early on.
Weight tying is a software engineering trick with statistical motivation: tying input embeddings and decoder softmax weights regularises solutions and trims memory—Section 3.4 points this out succinctly.
When you inspect transformer checkpoints, embeddings are literally the biggest lookup tables beside feed-forward matrices; quantization notebooks often shrink them first.
This topic sets up positional encodings immediately after: embeddings carry ‘what’, positions carry ‘where’—orthogonal concerns the paper merges by addition.