3/5 The Transformer in 90 Seconds (Then the Other 900)

August 24, 2025 · 6 min read

Previous: Why Can’t LLMs Spell “Raspberry”? It’s Tokenization.

The 90-Second Version

Here’s the entire GPT-2 pipeline:

Prompt
→ tokenize (Post 2)
→ token embedding + position embedding
→ 12x { Attention + MLP + residuals + LayerNorm }
→ logits (50,257 scores)
→ softmax → sample (Post 4)
→ next token
→ repeat

Input: up to 1024 tokens. Output: 50,257 logits. Pick one, append it, run again. That’s the loop.

Everything below is the “other 900 seconds” — what happens inside those 12 blocks.

Names Are Confusing

The transformer paper introduced terminology borrowed from signal processing, information retrieval, and physics. If the names sound weird, it’s because they are.

Term	Sounds like…	Actually means…
Transformer	A power brick	A sequence model using self-attention
Attention	Human focus	A weighted mix based on similarity scores
Heads	CPU cores	Parallel pattern finders (a filter bank)
Encoder/Decoder	A codec	Bidirectional vs left-to-right processing
Temperature	Degrees Celsius	A randomness scale on logits
Logits	Logic levels	Pre-softmax raw scores

If you think in electronics: attention is a crossbar switch. Multi-head attention is a filter bank. Positional embeddings are a timebase counter. Top-k is priority gating.

Embeddings: Turning Tokens into Vectors

The model can’t do math on token IDs (integers). It needs vectors — points in a high-dimensional space where similar meanings are close together.

GPT-2 small uses two learned embedding tables:

Token embeddings: 50,257 tokens x 768 dimensions. Each token gets a 768-dimensional vector.
Position embeddings: 1024 positions x 768 dimensions. Each position in the sequence gets its own vector.

The input to the first transformer block is:

x[i] = token_embed[token_id] + pos_embed[position]

Token identity + position. The model knows both what the token is and where it sits in the sequence.

The shape stays (batch, sequence_length, 768) throughout the entire stack. Every block takes 768-dimensional vectors in and puts 768-dimensional vectors out.

Attention: The Weighted Mix

Attention answers one question: “For each position, which other positions should I pay attention to?”

The mechanism:

From the input, compute three vectors per position: Query (Q), Key (K), Value (V)
Q and K determine how much each position attends to each other position (similarity scores)
Those scores weight V — the actual content to mix in

graph LR A["Input x"] --> Q["Query (Q)"] A --> K["Key (K)"] A --> V["Value (V)"] Q --> S["Q * K^T / sqrt(d)"] K --> S S --> W["softmax → weights"] W --> O["weights * V → output"] V --> O

Think of it as a routing network: Q says “what am I looking for?”, K says “what do I have?”, and the similarity between them determines how much information flows from V.

Multi-Head: Parallel Pattern Finders

GPT-2 runs 12 attention heads in parallel. Each head has its own Q, K, V projections and learns to detect different patterns:

One head might track syntax (subject-verb agreement)
Another might track coreference (which “it” refers to)
Another might track rhythm or style

The outputs are concatenated and projected back to 768 dimensions. Like a filter bank in signal processing — each filter sees the same input but extracts different features.

The Causal Mask: No Spoilers

GPT-2 is a decoder-only model. It predicts the next token by looking only at tokens to the left. It cannot peek at future tokens.

This is enforced by a causal mask — a triangular matrix that zeros out attention scores for positions to the right:

Position:  1  2  3  4
Token 1:  [1  0  0  0]   ← can only see itself
Token 2:  [1  1  0  0]   ← can see tokens 1-2
Token 3:  [1  1  1  0]   ← can see tokens 1-3
Token 4:  [1  1  1  1]   ← can see tokens 1-4

No looking ahead. Each position only attends to itself and everything before it. This is what makes it a “decoder” — it generates left to right.

Why Decoder-Only?

The original Transformer had both an encoder (bidirectional, sees everything) and a decoder (left-to-right, generates output). That design was for translation: the encoder reads the source language, the decoder writes the target language.

GPT-2 dropped the encoder. The task is simpler: predict the next token in the same stream. No source/target distinction. Just left-to-right generation with a causal mask.

graph TD A["Encoder-Decoder (T5, BART)"] --> B["Encoder reads source"] A --> C["Decoder generates target"] D["Decoder-Only (GPT-2, Llama)"] --> E["Single stream, left-to-right"] E --> F["Causal mask prevents looking ahead"] style D fill:#e8f5e9,stroke:#2e7d32

Most modern LLMs are decoder-only: GPT-2, GPT-4, Llama, Mistral, Claude. The architecture is simpler and scales well.

The MLP Block

After attention mixes information between positions, each position passes through a feed-forward network (MLP) independently:

MLP(x) = GELU(x * W1 + b1) * W2 + b2

This is where per-token computation happens — transforming the mixed representation into something more useful for the next layer. GELU is a smooth activation function (like ReLU, but differentiable everywhere).

Residual Connections and LayerNorm

Two features keep training stable:

Residual connections: The output of each sub-block (attention, MLP) is added to its input: output = x + block(x). This lets gradients flow directly through the network without vanishing.
LayerNorm: Normalizes activations to prevent them from growing or shrinking through 12 layers. Applied before each sub-block in GPT-2 (“pre-norm” style).

Training: Mid-Sentence Starts

How does GPT-2 learn? The training data is processed like this:

Tokenize all documents, inserting <|endoftext|> between them
Concatenate everything into one long stream
Slice into fixed 1024-token windows
At every position, predict the next token. Compute loss. Update weights.

A window can start mid-sentence. That’s a feature — it makes the model robust to arbitrary prompt positions. When you start a conversation with “Tell me about…”, the model doesn’t need a clean document start.

GPT-2 Small: The Config

All of this is described in one file: config.json

Parameter	Value	Meaning
`n_layer`	12	Transformer blocks (depth)
`n_head`	12	Attention heads per block
`n_embd`	768	Embedding dimension
`n_positions`	1024	Maximum context length
`vocab_size`	50257	Vocabulary size

12 layers, 12 heads, 768 dimensions, 1024 context. 124 million parameters total. Small enough to run on a laptop. Big enough to demonstrate every concept that scales to GPT-4.

Interactive

Want to see attention weights, embeddings, and logits in real time? The Transformer Explainer lets you type a prompt and watch GPT-2 process it step by step — attention patterns, layer outputs, and final token predictions.

Posts