What is Prompt Engineering?
Prompt engineering is the process of crafting and refining prompts through repeated trial and error. While this approach can work, it is highly labor-intensive and relies heavily on intuition and assumptions. It consumes significant time, effort, and cost. When experiments fail, which happens frequently. It often leads to frustration and forces you to start the entire process from scratch.
This method is useful for initial exploration and small-scale experiments. However, it becomes difficult to scale effectively when dealing with real-world data, diverse scenarios, and production-level requirements.
Key pain areas in prompt engineering:
Creating relevant prompt that AI understands
Generating golden dataset for verification
Handling change management
What is Automated Prompt Engineering?
Automated prompt engineering, also known as prompt optimization, is a systematic technique that removes much of the guesswork from the process. It automates the creation, testing, and refinement of prompts while providing clear, data-driven insights into why a particular prompt performs best for your use case.
By integrating prompt optimization into your workflows, you can significantly reduce manual effort, accelerate development cycles, and dramatically improve the reliability and performance of your Generative AI applications.
We will explore a prompt optimization approach using a prompt adapter that uses GEPA (Genetic-Pareto) optimization algorithm to optimize prompts. GEPA uses iterative mutation, reflection, and Pareto-aware candidate selection to improve text components like prompts. It leverages large language models to reflect on system behavior and propose improvements. This experiment will be done using MLflow running with help of Mistral.
Now, let's dive into hands-on for a use case. Let's take a use case to build a natural language based mathematician. We want an AI that can take any simple math question written in natural language and return the correct answer. We will follow below steps
Start with a basic prompt
Let GEPA automatically optimize it
End up with a high-accuracy mathematician prompt
Now, we will follow below steps to implement prompt optimization.
Step 1: Install Required Packages
pip install mlflow openai
Make sure you have:
Step 2: Import Libraries
import mlflow
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness
from openai import OpenAI
Step 3: Register the Base Prompt in MLflow Prompt Registry
base_prompt = mlflow.genai.register_prompt(
name="mathematician-prompt",
template=(
"You are a mathematician. For any maths question in natural language you have to return response.\n\n"
"Question: {{ question }}\n\n"
"Answer:"
),
)
This creates version 1 of our starting prompt in the registry.
Step 4: Prepare Training Data (Golden Dataset)
train_data = [
{
"inputs": {"question": "What is one plus one?"},
"expectations": {"expected_response": "Two"},
},
{
"inputs": {"question": "What is one minus one?"},
"expectations": {"expected_response": "Zero"},
},
{
"inputs": {"question": "What is three multiply by two?"},
"expectations": {"expected_response": "Six"},
},
{
"inputs": {"question": "What is fifty divided by five?"},
"expectations": {"expected_response": "Ten"},
}
]
These examples will be used both for optimization and scoring.
Step 5: Define the Prediction Function
This function loads the current prompt version, formats it, and calls your LLM (in this case, an Ollama endpoint via MLflow gateway).
def my_predict_fn(question: str):
prompt = mlflow.genai.load_prompt("prompts:/mathematician-prompt/1")
client = OpenAI(
base_url="http://localhost:5000/gateway/mlflow/v1",
api_key="", # API key not needed (configured server-side)
)
messages = [{"role": "user", "content": prompt.format(question=question)}]
response = client.chat.completions.create(
model="ollama-dev", # Your endpoint name
messages=messages,
)
return response.choices[0].message
Step 6: Run GEPA Prompt Optimization
result = mlflow.genai.optimize_prompts(
predict_fn=my_predict_fn,
train_data=train_data,
prompt_uris=[base_prompt.uri],
optimizer=GepaPromptOptimizer(
reflection_model="ollama:/ministral-3:3b",
max_metric_calls=500, # Maximum optimization budget
),
# LLM judge that scores every candidate prompt
scorers=[Correctness(model="ollama:/ministral-3:3b")],
enable_tracking=True,
)
What happens behind the scenes
GEPA creates multiple prompt variations (mutations)
The reflection model analyzes performance
Pareto-aware selection keeps the best trade-offs
The process repeats until the optimization budget is exhausted
The Correctness scorer acts as the reward signal
Step 7: Load and View the Optimized Prompt
optimized = mlflow.genai.load_prompt(result.optimized_prompts[0].uri)
print(optimized.template)
You will now see a significantly improved prompt template that GEPA discovered automatically.
Below is my output:
You are an expert mathematician specializing in precise arithmetic.
Rules:
- Compute the exact result of the math question.
- Respond with ONLY the final answer. Nothing else.
- Use number words for small results (e.g., Two, Zero, Six, Ten).
- Do not explain, do not show steps, do not add any extra text.
- Be extremely accurate with +, -, multiply, divide.
Question: {{ question }}
Answer:
Overall our code looks like below
import mlflow
from mlflow.genai.optimize import GepaPromptOptimizer
from mlflow.genai.scorers import Correctness
from openai import OpenAI
# Register a base prompt in the MLflow Prompt Registry
base_prompt = mlflow.genai.register_prompt(
name="mathematician-prompt",
template=(
"You are a mathematician. For any maths question in natural language you have to return response.\n\n"
"Question: {{ question }}\n\n"
"Answer:"
),
)
# Prepare training data with expected outputs
train_data = [
{
"inputs": {"question": "What is one plus one?"},
"expectations": {"expected_response": "Two"},
},
{
"inputs": {"question": "What is one minus one?"},
"expectations": {"expected_response": "Zero"},
},
{
"inputs": {"question": "What is three multiply by two?"},
"expectations": {"expected_response": "Six"},
},
{
"inputs": {"question": "What is fifty divided by five?"},
"expectations": {"expected_response": "Ten"},
}
]
def my_predict_fn(question: str):
prompt = mlflow.genai.load_prompt("prompts:/mathematician-prompt/1")
client = OpenAI(
base_url="http://localhost:5000/gateway/mlflow/v1",
api_key="", # API key not needed, configured server-side
)
messages = [{"role": "user", "content": prompt.format(question=question)}]
response = client.chat.completions.create(
model="ollama-dev", # Endpoint name as model
messages=messages,
)
return response.choices[0].message
# Run GEPA optimization
result = mlflow.genai.optimize_prompts(
predict_fn=my_predict_fn,
train_data=train_data,
prompt_uris=[base_prompt.uri],
optimizer=GepaPromptOptimizer(
reflection_model="ollama:/ministral-3:3b",
max_metric_calls=500,
),
# LLM judge that scores each candidate prompt's responses;
# the optimizer uses these scores as a reward signal
# to guide its search and identify prompt improvements
scorers=[Correctness(model="ollama:/ministral-3:3b")],
enable_tracking=True,
)
# Print the optimized prompt
optimized = mlflow.genai.load_prompt(result.optimized_prompts[0].uri)
print(optimized.template)
In MLflow we can see below output what different inputs were attempted.
![promptflow-promptoptimization]()
What You Get at the End
A production-ready prompt that consistently outperforms your manual version
Full MLflow tracking of every candidate prompt, scores, and mutations
Clear data-driven insights into why the final prompt works better
A repeatable, version-controlled optimization workflow
Prompt engineering remains one of the most time-consuming and inconsistent aspects of building Generative AI applications. Traditional manual prompt engineering relies on repeated trial-and-error, heavy human intuition, and extensive experimentation, making it labor-intensive, costly, and difficult to scale for real-world production use cases.
Automated prompt engineering addresses these challenges by systematically optimizing prompts using data-driven techniques. This article introduces MLflow's GEPA (Genetic-Pareto) Prompt Optimizer, an advanced approach that combines genetic evolution, Pareto-aware selection, LLM-based reflection, and automated scoring to discover high-performing prompts with minimal manual effort.