Transformer stacks
Encoders ingest source tokens producing contextual embeddings decoders fuse autoregressively.
Cross-attention bridges both towers so each decoding step can selectively read every encoder position—a pattern later variants trim for encoder-only pre-training or causal-only GPT stacks.