Topic 5
Local receptive fields, dilations, and the hop to fully global attention
ByteNet and ConvS2S showed you could parallelise more than RNNs, but widening context per layer still stacked depth; Transformer layers provide immediate global logits albeit at quadratic memory.
Math & statistics used here
- Conv1d is Toeplitz-structured sparse matmul across positions; receptive field grows ~linear with depth unless dilated.
- FLOPs per layer scale with kernel width × channels² but not with pairwise T² like full attention.
- Stacking nonlinearities repeats the depth-with-stability puzzle—batch norm analogies carry over mentally to LN in Transformers.
- Thinking in tensors clarifies hybrid designs: concatenate or add axes for heads the same way you stack CNN feature maps.
Checklist you can map to code
- Convolutional filters see k nearby tokens unless you deepen the network or dilate kernels.
- Depth increases receptive field roughly linearly with layer count absent dilation tricks.
- GPU kernels for Conv1d are efficient, motivating hybrid CNN–Transformer hybrids later.
- Padding, stride, and dilation hyperparameters resemble systems tuning more than NLP linguistics—which prepared teams for analogous head-count tuning in Transformers.
- None of this removes the pairwise comparison idea now central to self-attention.
Applying CNN thinking to sentences treats them as 1-D signals with d_model channels—each temporal position is a vector, filters slide across—but capturing subject–verb agreement across twenty tokens demands either many stacked layers or aggressive dilation schedules that sacrifice locality inductive biases you might still desire.
ConvS2S results sat at the competitiveness frontier right before Transformer; they proved non-recurrent parallelism was plausible. The Transformer abstract explicitly contrasts recurrence with fully attentional approaches to highlight representational completeness without convolutional mixtures.
Engineering takeaway: softmax attention resembles a densely connected bipartite similarity graph updated per layer; convolution resembles a sparse graph exploiting locality. Designers still explore sparse attention patterns marrying both intuitions.
When profiling kernels, naive attention O(n²) hurts at long contexts, yet moderate lengths enjoy highly optimised GEMM-heavy paths—different trade-offs from depth-stacked Conv1d.
Bridging topics: positional encodings resemble dilated-phase patterns from signal processing—they inject frequency information analogous to manipulating phase shifts to encode relative displacement.