Understanding LLM Generation (Decoder) Parameters (Sample/Inference Parameter): Control, Creativity, and Output

Jayant Kumar
Jan 10
556
0
1

Article

Introduction

Large Language Models (LLMs) like GPT, LLaMA, and Ollama generate text by predicting the next token from probabilities. Controlling these probabilities allows developers to influence creativity, randomness, length, and coherence of outputs. This article explains the key parameters and provides example code to experiment with them.

How LLMs Generate Text

An LLM takes a prompt and produces a probability distribution over the next token:

P(token ∣ previous tokens)

The model then samples from this distribution to generate the next word. Parameters such as temperature, top-k, and top-p shape this distribution, controlling the output style.

1. Temperature – "Randomness / Creativity"

What it does: Controls how likely the model is to pick fewer probable tokens.
Low temperature: model picks the safest, most likely word.
High temperature: model may choose creative or unusual words.
temperature = 0 Deterministic (greedy)
temperature = 1 Creative
temperature > 1 Very creative but risky

Example:
Prompt: "The cat sat on the _______"

Temperature	Output Example
0.0	"mat."
0.5	"sofa."
1.0	"keyboard."
1.5	"spaceship."

2. Top-k – "Limit Choices"

What it does: Only allows the model to consider the top k most probable tokens at each step.
Effect: Reduces the chance of weird, very low-probability words.
It Provent unlikely words.
More control than temperature alone.

Top-K

5- very safe
40- Balanced
100- More creative

Example:
Prompt: "I went to the __________"

Top-k	Output Example
5	"store"
50	"store", "park", "library"
100	"store", "park", "library", "zoo", "museum"

3. Top-p / Nucleus Sampling– "Cumulative Probability"

What it does: Instead of fixed k, consider only tokens whose cumulative probability ≤ p.
Effect: More adaptive than top-k.

Top-P

0.5- Very focused
0.9- Balanced
0.95- Creative

Example:
Prompt: "She loves eating _____________"

Top-p	Output Example
0.5	"pizza"
0.8	"pizza", "ice cream"
0.95	"pizza", "ice cream", "sushi", "cake"

4. Max Tokens / Max Length

What it does: Limits the number of tokens generated.
Effect: Controls output length.
Prevents long running output.

Example:
Prompt: "Explain gravity in one sentence"

Max Tokens	Output Example
5	"Gravity pulls things."
15	"Gravity is the force that pulls objects toward Earth."
30	"Gravity is the universal force that attracts all objects with mass towards each other, keeping planets in orbit."

5. Presence Penalty

What it does: Discourages the model from repeating words it has already used.
Effect: Encourages new topics/words.
Very useful for story telling

Example:
Prompt: "Write a story about a dragon"

Presence Penalty	Output Difference
0	"The dragon flew. The dragon breathed fire. The dragon roared."
1.0	"The dragon soared. It breathed fire. Its roar echoed through the mountains."

6. Frequency Penalty

What it does: Penalizes repeated words based on how often they appear.
Effect: Reduces monotonous repetition.

Example: Prompt: "Describe a forest"

Frequency Penalty	Output Example
0	"The trees were tall. The trees were green. The trees were beautiful."
1.0	"Tall trees stretched toward the sky, their leaves shimmering in the sunlight."

7. Stop Sequences

What it does: Defines text patterns where the model stops generating.

Example: Prompt: "List three fruits: "

Stop sequences: ["\n"]
Output: "Apple, Banana, Orange" stops at newline

Without stop sequence, it might continue generating more fruits endlessly.

8. Logit Bias

What it does: Forces certain tokens to appear or avoid certain tokens.

Example:
Prompt: "The answer is _____________ "

Logit bias: {50256: -100} - token 50256 (maybe "42") will almost never appear
Effect: Model avoids that word

Useful for forcing or blocking words.

9. Seed

What it does: Ensures reproducible outputs when randomness is involved.
Same prompt + same temperature + same seed = same text.

A seed fixes the random number generator, so the model produces the same output every time for the same prompt and parameters. if we set the value to any number, it will generate same answer again and again. The only exception is, if temperature is set to 0, seed will be ignored.

Recommended Parameter Settings

Task	Temperature	Top-p	Top-k
Math / logic	0.0–0.2	1.0	20
Code	0.1–0.3	0.9	40
Technical writing	0.3–0.5	0.9	40
Chatbot	0.6–0.8	0.9	50
Storytelling	0.8–1.2	0.95	100

Ollama Implementation

import requests
import json

url = "http://localhost:11434/api/generate"

payload = {
    "model": "llama3",
    "prompt": "Write a short story about a robot",
    "temperature": 0.7,
    "top_k": 40,
    "top_p": 0.9,
    "num_predict": 100,
    "presence_penalty": 0.6,
    "frequency_penalty": 0.4,
    "seed": 42,
    "stop": ["\n"]
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        print(json.loads(line)["response"], end="")

Key takeaway

Temperature, top-p, top-k - control creativity & randomness
Max tokens, stop sequences - control length & end of generation
Presence/frequency penalties - control repetition
Logit bias - control specific words
Seed - reproducibility