4/5 Be the Language DJ: Temperature, Top-k, and Top-p

· 6 min read

Previous: The Transformer in 90 Seconds

The transformer produces 50,257 logits — one raw score for every token in GPT-2’s vocabulary. Now what?

You need to pick one. That’s sampling. And the three sliders that control it — temperature, top-k, and top-p — are the difference between boring cliches and creative chaos.

Logits: What They Are

Logits are raw scores before softmax. They’re not probabilities. They can be negative. They don’t sum to anything meaningful.

What matters:

  • Only differences between logits matter (shift-invariant)
  • Higher logit = higher chance of being selected
  • They become probabilities after softmax

Softmax: Turning Scores into Probabilities

Softmax converts logits into a probability distribution:

$$\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}$$

Why exponential ($e^x$)?

  • Preserves ranking: if $z_i > z_j$, then $e^{z_i} > e^{z_j}$
  • Magnifies differences: small logit gaps become large probability gaps
  • Smooth and differentiable: critical for training
  • All outputs positive, sum to 1: valid probability distribution

Example: logits [2, 3] become probabilities [0.27, 0.73]. Logits [2, 8] become [0.0025, 0.9975]. The exponential sharpens the preference.

Temperature: The Confidence Slider

Temperature ($\tau$) rescales logits before softmax:

$$\text{softmax}(z_i / \tau)$$

TemperatureEffectUse case
$\tau < 1.0$Sharper distribution, more confidentSafe, predictable output
$\tau = 1.0$No change (default)Balanced
$\tau > 1.0$Flatter distribution, more randomCreative, exploratory
$\tau \to 0$Greedy (always picks the top token)Deterministic

The Numbers

Logits = [10, 9, 8] with three candidate tokens:

TemperatureProbabilitiesWhat happens
$\tau$ = 0.7[0.84, 0.15, 0.01]Almost always picks token #1
$\tau$ = 1.0[0.67, 0.24, 0.09]Usually token #1, sometimes #2
$\tau$ = 1.3[0.54, 0.28, 0.18]More variety, token #3 has a real chance

Temperature doesn’t change the ranking — it changes the confidence. Low = safe. High = wild.

Top-k: Fixed Shortlist

Top-k keeps the k highest-scoring tokens and drops everything else (sets their logits to $-\infty$).

  • top_k = 1 → greedy (same as $\tau = 0$)
  • top_k = 10 → only the 10 best candidates survive
  • top_k = 50 → moderate diversity
  • top_k = 0 → no filtering (all 50,257 tokens eligible)

Good for fixed-size control — you always know how many candidates are in the pool.

Top-p (Nucleus Sampling): Adaptive Shortlist

Top-p keeps the smallest set of tokens whose cumulative probability is at least $p$.

The size of the shortlist adapts to the distribution:

  • If the model is confident (one token has 95% probability), the shortlist might be 1-2 tokens
  • If the model is uncertain (probability spread across many tokens), the shortlist might be 50+

Example

Sorted probabilities: [0.40, 0.30, 0.15, 0.10, 0.05]

MethodKept tokensWhy
top_k = 3[0.40, 0.30, 0.15]Fixed: always 3
top_p = 0.85[0.40, 0.30, 0.15]0.40 + 0.30 + 0.15 = 0.85 >= p
top_p = 0.95[0.40, 0.30, 0.15, 0.10]Needs 4 tokens to reach 0.95

Top-p adapts. Top-k doesn’t. For most content generation, top-p is the better default.

After Filtering: Renormalize

After top-k or top-p removes tokens, softmax is recomputed over only the remaining candidates. The filtered tokens get logit $-\infty$ (probability 0), and the survivors are renormalized to sum to 1.

This means the actual sampling always draws from a valid probability distribution — just a narrower one.

The Cheat Sheet

Use caseTemperatureTop-pTop-k
Drafting / factual0.7 — 0.90.90 — 50
Creative writing0.9 — 1.20.9 — 0.9540 — 100
Deterministic / reproducible01.00

Failure modes

ProblemLikely cause
Bland cliches, repetitionTemperature too low
Incoherent ramblingTemperature too high + top-p too high
Loops (same phrase repeated)Small top-k without repetition penalty
Output stops making senseAll three sliders too aggressive

Combining them

Temperature, top-k, and top-p are applied in sequence:

  1. Temperature rescales the logits
  2. Top-k removes all but the top k tokens
  3. Top-p further trims to the nucleus
  4. Softmax normalizes the survivors
  5. Random sample from the result

You can combine all three. A common setup: temperature=0.9, top_p=0.9, top_k=50 — moderate creativity with a safety net.

Other Controls

Beyond the big three, most APIs expose:

ParameterWhat it does
repetition_penaltyPenalizes tokens that already appeared
presence_penaltyPenalizes based on whether a token appeared at all
frequency_penaltyPenalizes based on how often a token appeared
stopStop sequences (halt generation on specific strings)
max_new_tokensHard limit on output length
seedFix the random seed for reproducibility
logprobsReturn log-probabilities for diagnostics

These live in the model’s generation_config.json on Hugging Face — or as API parameters on OpenAI/Groq.

Try It

Presets to try:

  • Greedy: $\tau = 0$, $p = 1.0$, $k = 0$ — safe and bland
  • Temperature sweep: fix $p = 0.9$, $k = 50$; sweep $\tau$ from 0.6 to 1.3
  • Top-p vs top-k: fix $\tau = 0.9$; compare $p = 0.9$ (k=0) vs $k = 40$ (p=1.0)

Next: Your First Model: Navigating a Hugging Face Repo