August 24, 2025 · 6 min read 3/5 The Transformer in 90 Seconds (Then the Other 900) Attention is a weighted mix. Multi-head is a filter bank. The causal mask means no spoilers. Here's the transformer architecture without the math — then with just enough of it. #LLM #transformer #attention #GPT-2 #AI