When I first started experimenting with large language models (LLMs), I quickly realized something: writing a “good prompt” is only half the battle. The real challenge is making sure your prompts consistently give reliable, unbiased, and cost-effective results.
That’s where Promptfoo comes into play. It’s an open-source tool designed to help developers test, evaluate, and benchmark prompts systematically—almost like unit tests for your LLM workflows.
Instead of crossing fingers and hoping the model behaves, Promptfoo lets you:
Compare multiple prompts side by side.
Run evaluations across different LLMs (OpenAI, Anthropic, local Ollama models, etc.).
Automate regression testing so future changes don’t break your carefully tuned prompts.
Think of it as your CI/CD pipeline for prompts.
Why Use Promptfoo?
Imagine you’re building an AI assistant that extracts key requirements from a job description. One day, it performs flawlessly. The next, it hallucinates, misses half the details, or outputs inconsistent formatting. Without a structured testing tool, you’ll waste time debugging by trial and error.
Promptfoo gives you a repeatable testing setup: you define your test cases (inputs + expected behavior), run them against your prompt templates, and see metrics that help you refine faster.
Getting Started with Promptfoo
1. Install
Promptfoo is a Node.js package, so you’ll need npm or yarn:
npm install -g promptfoo
Check installation:
promptfoo --version
2. Define Your Prompt Template
Let’s say we want to summarize text into 3 key bullet points. Create a file named prompt.yaml
:
prompts:
- name: "Bullet Summary"
prompt: |
Summarize the following text into 3 clear bullet points:
{{text}}
Here, {{text}}
is a placeholder for input data.
3. Write Test Cases
Now, create a promptfoo.yaml
config file with sample inputs and expected outputs:
providers:
- openai:gpt-4o-mini
tests:
- vars:
text: "Artificial Intelligence is transforming healthcare by enabling faster diagnosis, personalized treatment, and predictive analytics."
assert:
- type: contains
value: "faster diagnosis"
- type: contains
value: "personalized treatment"
- type: contains
value: "predictive analytics"
- vars:
text: "Docker helps developers package applications with all dependencies and ship them consistently across environments."
assert:
- type: contains
value: "dependencies"
- type: contains
value: "environments"
Here’s what’s happening:
providers
→ which LLM to use (OpenAI, Anthropic, Ollama, etc.).
vars
→ test inputs.
assert
→ conditions the output must meet.
4. Run the Tests
Run the following command:
promptfoo eval
You’ll see a neat table of results: ✅ passed assertions vs. ❌ failed ones.
5. Compare Prompts or Models
Suppose you want to test whether GPT-4 or a local Llama 3 model via Ollama performs better. Just add them both under providers
:
providers:
- openai:gpt-4o-mini
- ollama:llama3
Promptfoo will run all tests against both models and generate comparison scores.
Example. Job Requirement Extractor
Let’s go back to our recruiter scenario. We want a prompt that extracts skills from job descriptions.
prompts:
- name: "Skill Extractor"
prompt: |
Extract the key skills required from this job description.
Return them as a comma-separated list.
Job Description: {{jd}}
And test cases:
tests:
- vars:
jd: "We are looking for a Python developer with experience in Django, REST APIs, and AWS."
assert:
- type: contains
value: "Python"
- type: contains
value: "Django"
- type: contains
value: "REST"
- type: contains
value: "AWS"
Run
promptfoo eval
Now, instead of hoping the AI extracts the right skills, you’ve got a structured guarantee.
Adding Human Evaluation
Not everything can be auto-tested. Sometimes, the “best” output is subjective (e.g., tone, creativity). Promptfoo supports human-in-the-loop evaluation: it’ll generate results and let you score them manually inside a review dashboard.
This is especially useful for marketing copy, long summaries, or conversational agents.
Closing Thoughts
For me, discovering Promptfoo was like moving from “cowboy coding” to “professional engineering” for prompts. It bridges the gap between experimentation and reliability.
If you’re serious about building production-grade AI systems, treat your prompts like code—version them, test them, benchmark them. Promptfoo makes that possible.
👉 Next time you tweak a prompt, don’t just eyeball the output. Run it through Promptfoo and let the results speak.