Hey folks,

Last week we covered Tokens & Context Windows. This week: temperature and sampling, the settings that control how random or predictable your AI outputs are.

You've probably noticed this. You ask ChatGPT the same question twice and get two different answers. Or you're using an AI coding assistant and it suggests a completely different function name each time you hit tab. Sometimes that variety is great. Sometimes you just want the same reliable answer. The difference comes down to a handful of settings most people never touch.

How Models Actually Pick the Next Word

Before we get to temperature, you need to understand how LLMs generate text. They don't "think" about what to say. They predict the next token, one at a time.

For every position in the output, the model calculates a probability for every token in its vocabulary (often 50,000+ tokens). Then it picks one. Then it does it again for the next position.

Say you prompt: "The capital of France is"

The model might assign probabilities like:

  • "Paris" : 96.1%
  • "Lyon" : 1.2%
  • "the" : 0.8%
  • "Marseille" : 0.4%
  • ... thousands more tokens with tiny probabilities

The question is: how does the model choose from this list? That's where sampling comes in.

What Is Temperature?

Temperature is a number (usually between 0 and 2) that controls how much randomness the model uses when picking the next token.

Think of it like a confidence dial:

  • Temperature 0: The model always picks the highest-probability token. Deterministic. Predictable. The same input gives you the same output (or very close to it).
  • Temperature 0.7: The default for most chatbots. The model mostly picks likely tokens but occasionally takes a less obvious path. You get variety without chaos.
  • Temperature 1.0: The model samples directly from its raw probability distribution. More creative, more surprising, more likely to go off-script.
  • Temperature 1.5+: The model starts giving real weight to low-probability tokens. Outputs get weird. Sometimes interesting-weird, often broken-weird.

Here's what's actually happening mathematically. Temperature divides the raw scores (called logits) before converting them to probabilities. A low temperature makes the gap between likely and unlikely tokens bigger. A high temperature flattens everything out, giving unlikely tokens a better shot.

That "Paris" example from above? At temperature 0, the model picks "Paris" every time. At temperature 1.5, "Lyon" and "Marseille" suddenly have a real chance of showing up.

Temperature Is Not Creativity

I've seen this misconception a lot. People crank temperature to 1.5 thinking they'll get more creative writing. What they actually get is less coherent writing.

Temperature doesn't make the model smarter or more imaginative. It just makes it more willing to pick unlikely tokens. Sometimes that produces interesting word combinations. More often it produces grammatical errors, factual mistakes, and sentences that trail off into nonsense.

If you want genuinely better creative output, you're usually better off keeping temperature moderate (0.7 to 1.0) and improving your prompt instead.

Top-p (Nucleus Sampling)

Temperature isn't the only knob. Top-p (also called nucleus sampling) is another way to control randomness, and it works differently.

Instead of scaling all probabilities, top-p sets a cutoff. The model sorts tokens by probability, adds them up from most likely to least likely, and stops when it hits the threshold you set.

With top-p = 0.9, the model only considers the smallest set of tokens whose combined probability is 90%. Everything else gets zeroed out.

Setting What It Does Effect
Temperature 0.3 Sharpens probability distribution Very predictable, focused
Temperature 0.7 Moderate randomness Balanced, default for chat
Temperature 1.2 Flattens distribution Unpredictable, sometimes incoherent
Top-p 0.1 Only considers top ~10% probability mass Very narrow token choices
Top-p 0.9 Considers top ~90% probability mass Wide but still bounded

Most APIs let you set both temperature and top-p simultaneously. OpenAI's recommendation: adjust one and leave the other at its default. Tuning both at once makes it hard to predict what you'll get.

Top-k and Other Sampling Methods

Top-k is simpler than top-p. It just limits the model to the k most probable tokens, regardless of their probabilities. Set top-k to 50 and the model picks from only the top 50 candidates.

The downside: top-k doesn't adapt. For a prompt like "The capital of France is," the top 5 tokens cover 99% of the probability mass, so top-k = 50 is wasteful. For a more open-ended prompt like "Write me a poem about," the probability is spread across thousands of tokens, so top-k = 50 might be too restrictive.

Top-p adapts naturally to this. That's why most modern APIs default to top-p over top-k.

There are a few other sampling parameters you'll encounter:

  • Frequency penalty: Reduces the chance of repeating tokens that already appeared in the output. Helps avoid loops where the model says the same phrase over and over.
  • Presence penalty: Similar, but it doesn't care how many times a token appeared, just whether it appeared at all. Encourages the model to bring in new topics.
  • Repetition penalty: A broader version used by some open-source models. Same idea: stop repeating yourself.

Which Settings for Which Job

This is where it gets practical. The right temperature depends entirely on what you're building.

Use Case Temperature Top-p Why
Code generation 0 - 0.2 0.1 - 0.3 You want correct, predictable code
Data extraction / classification 0 1.0 One right answer, no creativity needed
Customer support chatbot 0.3 - 0.5 0.8 Consistent but not robotic
General conversation 0.7 0.9 Balanced, natural-sounding
Creative writing / brainstorming 0.8 - 1.0 0.95 Variety and surprise, within reason

If you're using ChatGPT through the web interface, you don't get to set these directly (OpenAI picks defaults for you). But if you're using the API, or tools like Cursor, LM Studio, or Ollama, these settings are exposed and worth tuning.

This is why the same model can feel completely different in two products. One app using GPT-4o at temperature 0 for code review will feel rigid and precise. Another using the same model at temperature 0.9 for storytelling will feel loose and surprising. Same brain, different dial settings.

A Quick Example

Here's how you'd set temperature and top-p in an OpenAI API call:

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    temperature=0.2,
    top_p=0.9,
    messages=[
        {"role": "user", "content": "Summarize the key risks of this contract clause: ..."}
    ]
)

print(response.choices[0].message.content)

For a legal summarization task like this, low temperature is the right call. You want accuracy and consistency, not flair.

The Seed Parameter

One more thing worth knowing. Even at temperature 0, you might occasionally get slightly different outputs from the same prompt. This happens because of floating-point math and infrastructure differences on the provider's side.

If you need truly reproducible outputs (for testing, compliance, or debugging), some APIs offer a seed parameter. Set the same seed with the same prompt and model, and you'll get identical outputs. OpenAI supports this, though they note it's "best effort" rather than a guarantee.

Key Takeaway

Temperature and sampling settings control how your model picks from its predicted token probabilities. They don't change what the model knows. They change how it chooses.

  • Temperature 0 means greedy decoding: always pick the most likely token. Predictable but sometimes flat.
  • Temperature 0.7 is the sweet spot for most conversational use cases.
  • Going above 1.0 doesn't add creativity. It adds randomness. Those aren't the same thing.
  • Top-p dynamically limits the token pool. More adaptive than top-k.
  • Tune temperature OR top-p, not both at once.
  • Frequency and presence penalties help with repetition, not randomness.
  • For production systems, set these explicitly. Don't rely on defaults you didn't choose.

What's Next

Next week: Model Quantization, Making AI faster and cheaper.

Read the full AI Learning series -> Learn AI

Thanks for reading! Got questions, feedback, or want to chat about AI? Hit reply – I read and respond to every message. And if you found this valuable, feel free to forward it to a friend who'd benefit!

Pranay
Pranay
Infolia.ai

💬 Join the Discussion