LLMs  

Understanding LLM Generation (Decoder) Parameters (Sample/Inference Parameter): Control, Creativity, and Output

Introduction

Large Language Models (LLMs) like GPT, LLaMA, and Ollama generate text by predicting the next token from probabilities. Controlling these probabilities allows developers to influence creativity, randomness, length, and coherence of outputs. This article explains the key parameters and provides example code to experiment with them.

How LLMs Generate Text

An LLM takes a prompt and produces a probability distribution over the next token:

P(token ∣ previous tokens)

The model then samples from this distribution to generate the next word. Parameters such as temperature, top-k, and top-p shape this distribution, controlling the output style.

1. Temperature – "Randomness / Creativity"

  • What it does: Controls how likely the model is to pick fewer probable tokens.

  • Low temperature: model picks the safest, most likely word.

  • High temperature: model may choose creative or unusual words.

  • temperature = 0 Deterministic (greedy)

  • temperature = 1 Creative

  • temperature > 1 Very creative but risky

Example:
Prompt: "The cat sat on the _______"

TemperatureOutput Example
0.0"mat."
0.5"sofa."
1.0"keyboard."
1.5"spaceship."

2. Top-k – "Limit Choices"

  • What it does: Only allows the model to consider the top k most probable tokens at each step.

  • Effect: Reduces the chance of weird, very low-probability words.

  • It Provent unlikely words.

  • More control than temperature alone.

Top-K

  • 5- very safe

  • 40- Balanced

  • 100- More creative

Example:
Prompt: "I went to the __________"

Top-kOutput Example
5"store"
50"store", "park", "library"
100"store", "park", "library", "zoo", "museum"

3. Top-p / Nucleus Sampling– "Cumulative Probability"

  • What it does: Instead of fixed k, consider only tokens whose cumulative probability ≤ p.

  • Effect: More adaptive than top-k.

Top-P

  • 0.5- Very focused

  • 0.9- Balanced

  • 0.95- Creative

Example:
Prompt: "She loves eating _____________"

Top-pOutput Example
0.5"pizza"
0.8"pizza", "ice cream"
0.95"pizza", "ice cream", "sushi", "cake"

4. Max Tokens / Max Length

  • What it does: Limits the number of tokens generated.

  • Effect: Controls output length.

  • Prevents long running output.

Example:
Prompt: "Explain gravity in one sentence"

Max TokensOutput Example
5"Gravity pulls things."
15"Gravity is the force that pulls objects toward Earth."
30"Gravity is the universal force that attracts all objects with mass towards each other, keeping planets in orbit."

5. Presence Penalty

  • What it does: Discourages the model from repeating words it has already used.

  • Effect: Encourages new topics/words.

  • Very useful for story telling

Example:
Prompt: "Write a story about a dragon"

Presence PenaltyOutput Difference
0"The dragon flew. The dragon breathed fire. The dragon roared."
1.0"The dragon soared. It breathed fire. Its roar echoed through the mountains."

6. Frequency Penalty

  • What it does: Penalizes repeated words based on how often they appear.

  • Effect: Reduces monotonous repetition.

Example: Prompt: "Describe a forest"

Frequency PenaltyOutput Example
0"The trees were tall. The trees were green. The trees were beautiful."
1.0"Tall trees stretched toward the sky, their leaves shimmering in the sunlight."

7. Stop Sequences

  • What it does: Defines text patterns where the model stops generating.

Example: Prompt: "List three fruits: "

  • Stop sequences: ["\n"]

  • Output: "Apple, Banana, Orange" stops at newline

Without stop sequence, it might continue generating more fruits endlessly.

8. Logit Bias

  • What it does: Forces certain tokens to appear or avoid certain tokens.

Example:
Prompt: "The answer is _____________ "

  • Logit bias: {50256: -100} - token 50256 (maybe "42") will almost never appear

  • Effect: Model avoids that word

Useful for forcing or blocking words.

9. Seed

  • What it does: Ensures reproducible outputs when randomness is involved.

  • Same prompt + same temperature + same seed = same text.

A seed fixes the random number generator, so the model produces the same output every time for the same prompt and parameters. if we set the value to any number, it will generate same answer again and again. The only exception is, if temperature is set to 0, seed will be ignored.

Recommended Parameter Settings

TaskTemperatureTop-pTop-k
Math / logic0.0–0.21.020
Code0.1–0.30.940
Technical writing0.3–0.50.940
Chatbot0.6–0.80.950
Storytelling0.8–1.20.95100
Screenshot 2026-01-08 191444

Ollama Implementation

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3",
    "prompt": "Write a short story about a robot",
    "temperature": 0.7,
    "top_k": 40,
    "top_p": 0.9,
    "num_predict": 100,
    "presence_penalty": 0.6,
    "frequency_penalty": 0.4,
    "seed": 42,
    "stop": ["\n"]
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        print(json.loads(line)["response"], end="")

Key takeaway

  • Temperature, top-p, top-k - control creativity & randomness

  • Max tokens, stop sequences - control length & end of generation

  • Presence/frequency penalties - control repetition

  • Logit bias - control specific words

  • Seed - reproducibility