3/5 The Transformer in 90 Seconds (Then the Other 900)
Previous: Why Can’t LLMs Spell “Raspberry”? It’s Tokenization.
The 90-Second Version
Here’s the entire GPT-2 pipeline:
Prompt
→ tokenize (Post 2)
→ token embedding + position embedding
→ 12x { Attention + MLP + residuals + LayerNorm }
→ logits (50,257 scores)
→ softmax → sample (Post 4)
→ next token
→ repeat
Input: up to 1024 tokens. Output: 50,257 logits. Pick one, append it, run again. That’s the loop.
Everything below is the “other 900 seconds” — what happens inside those 12 blocks.
Names Are Confusing
The transformer paper introduced terminology borrowed from signal processing, information retrieval, and physics. If the names sound weird, it’s because they are.
| Term | Sounds like… | Actually means… |
|---|---|---|
| Transformer | A power brick | A sequence model using self-attention |
| Attention | Human focus | A weighted mix based on similarity scores |
| Heads | CPU cores | Parallel pattern finders (a filter bank) |
| Encoder/Decoder | A codec | Bidirectional vs left-to-right processing |
| Temperature | Degrees Celsius | A randomness scale on logits |
| Logits | Logic levels | Pre-softmax raw scores |
If you think in electronics: attention is a crossbar switch. Multi-head attention is a filter bank. Positional embeddings are a timebase counter. Top-k is priority gating.
Embeddings: Turning Tokens into Vectors
The model can’t do math on token IDs (integers). It needs vectors — points in a high-dimensional space where similar meanings are close together.
GPT-2 small uses two learned embedding tables:
- Token embeddings: 50,257 tokens x 768 dimensions. Each token gets a 768-dimensional vector.
- Position embeddings: 1024 positions x 768 dimensions. Each position in the sequence gets its own vector.
The input to the first transformer block is:
x[i] = token_embed[token_id] + pos_embed[position]
Token identity + position. The model knows both what the token is and where it sits in the sequence.
The shape stays (batch, sequence_length, 768) throughout the entire stack. Every block takes 768-dimensional vectors in and puts 768-dimensional vectors out.
Attention: The Weighted Mix
Attention answers one question: “For each position, which other positions should I pay attention to?”
The mechanism:
- From the input, compute three vectors per position: Query (Q), Key (K), Value (V)
- Q and K determine how much each position attends to each other position (similarity scores)
- Those scores weight V — the actual content to mix in
Think of it as a routing network: Q says “what am I looking for?”, K says “what do I have?”, and the similarity between them determines how much information flows from V.
Multi-Head: Parallel Pattern Finders
GPT-2 runs 12 attention heads in parallel. Each head has its own Q, K, V projections and learns to detect different patterns:
- One head might track syntax (subject-verb agreement)
- Another might track coreference (which “it” refers to)
- Another might track rhythm or style
The outputs are concatenated and projected back to 768 dimensions. Like a filter bank in signal processing — each filter sees the same input but extracts different features.
The Causal Mask: No Spoilers
GPT-2 is a decoder-only model. It predicts the next token by looking only at tokens to the left. It cannot peek at future tokens.
This is enforced by a causal mask — a triangular matrix that zeros out attention scores for positions to the right:
Position: 1 2 3 4
Token 1: [1 0 0 0] ← can only see itself
Token 2: [1 1 0 0] ← can see tokens 1-2
Token 3: [1 1 1 0] ← can see tokens 1-3
Token 4: [1 1 1 1] ← can see tokens 1-4
No looking ahead. Each position only attends to itself and everything before it. This is what makes it a “decoder” — it generates left to right.
Why Decoder-Only?
The original Transformer had both an encoder (bidirectional, sees everything) and a decoder (left-to-right, generates output). That design was for translation: the encoder reads the source language, the decoder writes the target language.
GPT-2 dropped the encoder. The task is simpler: predict the next token in the same stream. No source/target distinction. Just left-to-right generation with a causal mask.
Most modern LLMs are decoder-only: GPT-2, GPT-4, Llama, Mistral, Claude. The architecture is simpler and scales well.
The MLP Block
After attention mixes information between positions, each position passes through a feed-forward network (MLP) independently:
MLP(x) = GELU(x * W1 + b1) * W2 + b2
This is where per-token computation happens — transforming the mixed representation into something more useful for the next layer. GELU is a smooth activation function (like ReLU, but differentiable everywhere).
Residual Connections and LayerNorm
Two features keep training stable:
-
Residual connections: The output of each sub-block (attention, MLP) is added to its input:
output = x + block(x). This lets gradients flow directly through the network without vanishing. -
LayerNorm: Normalizes activations to prevent them from growing or shrinking through 12 layers. Applied before each sub-block in GPT-2 (“pre-norm” style).
Training: Mid-Sentence Starts
How does GPT-2 learn? The training data is processed like this:
- Tokenize all documents, inserting
<|endoftext|>between them - Concatenate everything into one long stream
- Slice into fixed 1024-token windows
- At every position, predict the next token. Compute loss. Update weights.
A window can start mid-sentence. That’s a feature — it makes the model robust to arbitrary prompt positions. When you start a conversation with “Tell me about…”, the model doesn’t need a clean document start.
GPT-2 Small: The Config
All of this is described in one file: config.json
| Parameter | Value | Meaning |
|---|---|---|
n_layer | 12 | Transformer blocks (depth) |
n_head | 12 | Attention heads per block |
n_embd | 768 | Embedding dimension |
n_positions | 1024 | Maximum context length |
vocab_size | 50257 | Vocabulary size |
12 layers, 12 heads, 768 dimensions, 1024 context. 124 million parameters total. Small enough to run on a laptop. Big enough to demonstrate every concept that scales to GPT-4.
Interactive
Want to see attention weights, embeddings, and logits in real time? The Transformer Explainer lets you type a prompt and watch GPT-2 process it step by step — attention patterns, layer outputs, and final token predictions.
Links
- GPT-2 on Hugging Face — model card and files
- GPT-2 config.json — architecture parameters
- Transformer Explainer — interactive visualization
- Attention Is All You Need — the 2017 paper