4/5 Be the Language DJ: Temperature, Top-k, and Top-p
Previous: The Transformer in 90 Seconds
The transformer produces 50,257 logits — one raw score for every token in GPT-2’s vocabulary. Now what?
You need to pick one. That’s sampling. And the three sliders that control it — temperature, top-k, and top-p — are the difference between boring cliches and creative chaos.
Logits: What They Are
Logits are raw scores before softmax. They’re not probabilities. They can be negative. They don’t sum to anything meaningful.
What matters:
- Only differences between logits matter (shift-invariant)
- Higher logit = higher chance of being selected
- They become probabilities after softmax
Softmax: Turning Scores into Probabilities
Softmax converts logits into a probability distribution:
$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Why exponential ($e^x$)?
- Preserves ranking: if $z_i > z_j$, then $e^{z_i} > e^{z_j}$
- Magnifies differences: small logit gaps become large probability gaps
- Smooth and differentiable: critical for training
- All outputs positive, sum to 1: valid probability distribution
Example: logits [2, 3] become probabilities [0.27, 0.73]. Logits [2, 8] become [0.0025, 0.9975]. The exponential sharpens the preference.
Temperature: The Confidence Slider
Temperature ($\tau$) rescales logits before softmax:
$$\text{softmax}(z_i / \tau)$$
| Temperature | Effect | Use case |
|---|---|---|
| $\tau < 1.0$ | Sharper distribution, more confident | Safe, predictable output |
| $\tau = 1.0$ | No change (default) | Balanced |
| $\tau > 1.0$ | Flatter distribution, more random | Creative, exploratory |
| $\tau \to 0$ | Greedy (always picks the top token) | Deterministic |
The Numbers
Logits = [10, 9, 8] with three candidate tokens:
| Temperature | Probabilities | What happens |
|---|---|---|
| $\tau$ = 0.7 | [0.84, 0.15, 0.01] | Almost always picks token #1 |
| $\tau$ = 1.0 | [0.67, 0.24, 0.09] | Usually token #1, sometimes #2 |
| $\tau$ = 1.3 | [0.54, 0.28, 0.18] | More variety, token #3 has a real chance |
Temperature doesn’t change the ranking — it changes the confidence. Low = safe. High = wild.
Top-k: Fixed Shortlist
Top-k keeps the k highest-scoring tokens and drops everything else (sets their logits to $-\infty$).
top_k = 1→ greedy (same as $\tau = 0$)top_k = 10→ only the 10 best candidates survivetop_k = 50→ moderate diversitytop_k = 0→ no filtering (all 50,257 tokens eligible)
Good for fixed-size control — you always know how many candidates are in the pool.
Top-p (Nucleus Sampling): Adaptive Shortlist
Top-p keeps the smallest set of tokens whose cumulative probability is at least $p$.
The size of the shortlist adapts to the distribution:
- If the model is confident (one token has 95% probability), the shortlist might be 1-2 tokens
- If the model is uncertain (probability spread across many tokens), the shortlist might be 50+
Example
Sorted probabilities: [0.40, 0.30, 0.15, 0.10, 0.05]
| Method | Kept tokens | Why |
|---|---|---|
top_k = 3 | [0.40, 0.30, 0.15] | Fixed: always 3 |
top_p = 0.85 | [0.40, 0.30, 0.15] | 0.40 + 0.30 + 0.15 = 0.85 >= p |
top_p = 0.95 | [0.40, 0.30, 0.15, 0.10] | Needs 4 tokens to reach 0.95 |
Top-p adapts. Top-k doesn’t. For most content generation, top-p is the better default.
After Filtering: Renormalize
After top-k or top-p removes tokens, softmax is recomputed over only the remaining candidates. The filtered tokens get logit $-\infty$ (probability 0), and the survivors are renormalized to sum to 1.
This means the actual sampling always draws from a valid probability distribution — just a narrower one.
The Cheat Sheet
Recommended ranges
| Use case | Temperature | Top-p | Top-k |
|---|---|---|---|
| Drafting / factual | 0.7 — 0.9 | 0.9 | 0 — 50 |
| Creative writing | 0.9 — 1.2 | 0.9 — 0.95 | 40 — 100 |
| Deterministic / reproducible | 0 | 1.0 | 0 |
Failure modes
| Problem | Likely cause |
|---|---|
| Bland cliches, repetition | Temperature too low |
| Incoherent rambling | Temperature too high + top-p too high |
| Loops (same phrase repeated) | Small top-k without repetition penalty |
| Output stops making sense | All three sliders too aggressive |
Combining them
Temperature, top-k, and top-p are applied in sequence:
- Temperature rescales the logits
- Top-k removes all but the top k tokens
- Top-p further trims to the nucleus
- Softmax normalizes the survivors
- Random sample from the result
You can combine all three. A common setup: temperature=0.9, top_p=0.9, top_k=50 — moderate creativity with a safety net.
Other Controls
Beyond the big three, most APIs expose:
| Parameter | What it does |
|---|---|
repetition_penalty | Penalizes tokens that already appeared |
presence_penalty | Penalizes based on whether a token appeared at all |
frequency_penalty | Penalizes based on how often a token appeared |
stop | Stop sequences (halt generation on specific strings) |
max_new_tokens | Hard limit on output length |
seed | Fix the random seed for reproducibility |
logprobs | Return log-probabilities for diagnostics |
These live in the model’s generation_config.json on Hugging Face — or as API parameters on OpenAI/Groq.
Try It
- Transformer Explainer — watch GPT-2 sample in real time
- Groq Playground — try gpt-oss-20b at 1114 tokens/sec with adjustable sampling
- OpenAI Playground — compare models with different settings
Presets to try:
- Greedy: $\tau = 0$, $p = 1.0$, $k = 0$ — safe and bland
- Temperature sweep: fix $p = 0.9$, $k = 50$; sweep $\tau$ from 0.6 to 1.3
- Top-p vs top-k: fix $\tau = 0.9$; compare $p = 0.9$ (k=0) vs $k = 40$ (p=1.0)
Links
- GPT-2 generation_config.json — default sampling parameters
- GPT-2 vocabulary — the 50,257 tokens being scored
- Transformer Explainer — interactive GPT-2
- Groq Playground — fast inference with adjustable knobs