4/5 Be the Language DJ: Temperature, Top-k, and Top-p

August 27, 2025 · 6 min read

#LLM #sampling #temperature #top-p #top-k #GPT-2 #AI

Previous: The Transformer in 90 Seconds

The transformer produces 50,257 logits — one raw score for every token in GPT-2’s vocabulary. Now what?

You need to pick one. That’s sampling. And the three sliders that control it — temperature, top-k, and top-p — are the difference between boring cliches and creative chaos.

Logits: What They Are

Logits are raw scores before softmax. They’re not probabilities. They can be negative. They don’t sum to anything meaningful.

What matters:

Only differences between logits matter (shift-invariant)
Higher logit = higher chance of being selected
They become probabilities after softmax

Softmax: Turning Scores into Probabilities

Softmax converts logits into a probability distribution:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Why exponential ($e^x$)?

Preserves ranking: if $z_i > z_j$, then $e^{z_i} > e^{z_j}$
Magnifies differences: small logit gaps become large probability gaps
Smooth and differentiable: critical for training
All outputs positive, sum to 1: valid probability distribution

Example: logits [2, 3] become probabilities [0.27, 0.73]. Logits [2, 8] become [0.0025, 0.9975]. The exponential sharpens the preference.

Temperature: The Confidence Slider

Temperature ($\tau$) rescales logits before softmax:

$$\text{softmax}(z_i / \tau)$$

Temperature	Effect	Use case
$\tau < 1.0$	Sharper distribution, more confident	Safe, predictable output
$\tau = 1.0$	No change (default)	Balanced
$\tau > 1.0$	Flatter distribution, more random	Creative, exploratory
$\tau \to 0$	Greedy (always picks the top token)	Deterministic

The Numbers

Logits = [10, 9, 8] with three candidate tokens:

Temperature	Probabilities	What happens
$\tau$ = 0.7	`[0.84, 0.15, 0.01]`	Almost always picks token #1
$\tau$ = 1.0	`[0.67, 0.24, 0.09]`	Usually token #1, sometimes #2
$\tau$ = 1.3	`[0.54, 0.28, 0.18]`	More variety, token #3 has a real chance

Temperature doesn’t change the ranking — it changes the confidence. Low = safe. High = wild.

Top-k: Fixed Shortlist

Top-k keeps the k highest-scoring tokens and drops everything else (sets their logits to $-\infty$).

top_k = 1 → greedy (same as $\tau = 0$)
top_k = 10 → only the 10 best candidates survive
top_k = 50 → moderate diversity
top_k = 0 → no filtering (all 50,257 tokens eligible)

Good for fixed-size control — you always know how many candidates are in the pool.

Top-p (Nucleus Sampling): Adaptive Shortlist

Top-p keeps the smallest set of tokens whose cumulative probability is at least $p$.

The size of the shortlist adapts to the distribution:

If the model is confident (one token has 95% probability), the shortlist might be 1-2 tokens
If the model is uncertain (probability spread across many tokens), the shortlist might be 50+

Example

Sorted probabilities: [0.40, 0.30, 0.15, 0.10, 0.05]

Method	Kept tokens	Why
`top_k = 3`	`[0.40, 0.30, 0.15]`	Fixed: always 3
`top_p = 0.85`	`[0.40, 0.30, 0.15]`	0.40 + 0.30 + 0.15 = 0.85 >= p
`top_p = 0.95`	`[0.40, 0.30, 0.15, 0.10]`	Needs 4 tokens to reach 0.95

Top-p adapts. Top-k doesn’t. For most content generation, top-p is the better default.

After Filtering: Renormalize

After top-k or top-p removes tokens, softmax is recomputed over only the remaining candidates. The filtered tokens get logit $-\infty$ (probability 0), and the survivors are renormalized to sum to 1.

This means the actual sampling always draws from a valid probability distribution — just a narrower one.

The Cheat Sheet

Recommended ranges

Use case	Temperature	Top-p	Top-k
Drafting / factual	0.7 — 0.9	0.9	0 — 50
Creative writing	0.9 — 1.2	0.9 — 0.95	40 — 100
Deterministic / reproducible	0	1.0	0

Failure modes

Problem	Likely cause
Bland cliches, repetition	Temperature too low
Incoherent rambling	Temperature too high + top-p too high
Loops (same phrase repeated)	Small top-k without repetition penalty
Output stops making sense	All three sliders too aggressive

Combining them

Temperature, top-k, and top-p are applied in sequence:

Temperature rescales the logits
Top-k removes all but the top k tokens
Top-p further trims to the nucleus
Softmax normalizes the survivors
Random sample from the result

You can combine all three. A common setup: temperature=0.9, top_p=0.9, top_k=50 — moderate creativity with a safety net.

Other Controls

Beyond the big three, most APIs expose:

Parameter	What it does
`repetition_penalty`	Penalizes tokens that already appeared
`presence_penalty`	Penalizes based on whether a token appeared at all
`frequency_penalty`	Penalizes based on how often a token appeared
`stop`	Stop sequences (halt generation on specific strings)
`max_new_tokens`	Hard limit on output length
`seed`	Fix the random seed for reproducibility
`logprobs`	Return log-probabilities for diagnostics

These live in the model’s generation_config.json on Hugging Face — or as API parameters on OpenAI/Groq.

Try It

Transformer Explainer — watch GPT-2 sample in real time
Groq Playground — try gpt-oss-20b at 1114 tokens/sec with adjustable sampling
OpenAI Playground — compare models with different settings

Presets to try:

Greedy: $\tau = 0$, $p = 1.0$, $k = 0$ — safe and bland
Temperature sweep: fix $p = 0.9$, $k = 50$; sweep $\tau$ from 0.6 to 1.3
Top-p vs top-k: fix $\tau = 0.9$; compare $p = 0.9$ (k=0) vs $k = 40$ (p=1.0)

Posts