Exploring Promptfoo: Testing and Evaluating LLM Prompts Like a Pro

Lokesh Varman
Sep 24
2.6k
0
2

Article

When I first started experimenting with large language models (LLMs), I quickly realized something: writing a “good prompt” is only half the battle. The real challenge is making sure your prompts consistently give reliable, unbiased, and cost-effective results.

That’s where Promptfoo comes into play. It’s an open-source tool designed to help developers test, evaluate, and benchmark prompts systematically—almost like unit tests for your LLM workflows.

Instead of crossing fingers and hoping the model behaves, Promptfoo lets you:

Compare multiple prompts side by side.
Run evaluations across different LLMs (OpenAI, Anthropic, local Ollama models, etc.).
Automate regression testing so future changes don’t break your carefully tuned prompts.

Think of it as your CI/CD pipeline for prompts.

Why Use Promptfoo?

Imagine you’re building an AI assistant that extracts key requirements from a job description. One day, it performs flawlessly. The next, it hallucinates, misses half the details, or outputs inconsistent formatting. Without a structured testing tool, you’ll waste time debugging by trial and error.

Promptfoo gives you a repeatable testing setup: you define your test cases (inputs + expected behavior), run them against your prompt templates, and see metrics that help you refine faster.

Getting Started with Promptfoo

1. Install

Promptfoo is a Node.js package, so you’ll need npm or yarn:

npm install -g promptfoo

Check installation:

promptfoo --version

2. Define Your Prompt Template

Let’s say we want to summarize text into 3 key bullet points. Create a file named prompt.yaml:

prompts:
  - name: "Bullet Summary"
    prompt: |
      Summarize the following text into 3 clear bullet points:
      {{text}}

Here, {{text}} is a placeholder for input data.

3. Write Test Cases

Now, create a promptfoo.yaml config file with sample inputs and expected outputs:

providers:
  - openai:gpt-4o-mini
tests:
  - vars:
      text: "Artificial Intelligence is transforming healthcare by enabling faster diagnosis, personalized treatment, and predictive analytics."
    assert:
      - type: contains
        value: "faster diagnosis"
      - type: contains
        value: "personalized treatment"
      - type: contains
        value: "predictive analytics"
  - vars:
      text: "Docker helps developers package applications with all dependencies and ship them consistently across environments."
    assert:
      - type: contains
        value: "dependencies"
      - type: contains
        value: "environments"

Here’s what’s happening:

providers → which LLM to use (OpenAI, Anthropic, Ollama, etc.).
vars → test inputs.
assert → conditions the output must meet.

4. Run the Tests

Run the following command:

promptfoo eval

You’ll see a neat table of results: ✅ passed assertions vs. ❌ failed ones.

5. Compare Prompts or Models

Suppose you want to test whether GPT-4 or a local Llama 3 model via Ollama performs better. Just add them both under providers:

providers:
  - openai:gpt-4o-mini
  - ollama:llama3

Promptfoo will run all tests against both models and generate comparison scores.

Example. Job Requirement Extractor

Let’s go back to our recruiter scenario. We want a prompt that extracts skills from job descriptions.

prompts:
  - name: "Skill Extractor"
    prompt: |
      Extract the key skills required from this job description.
      Return them as a comma-separated list.
      Job Description: {{jd}}

And test cases:

tests:
  - vars:
      jd: "We are looking for a Python developer with experience in Django, REST APIs, and AWS."
    assert:
      - type: contains
        value: "Python"
      - type: contains
        value: "Django"
      - type: contains
        value: "REST"
      - type: contains
        value: "AWS"

Run

promptfoo eval

Now, instead of hoping the AI extracts the right skills, you’ve got a structured guarantee.

Adding Human Evaluation

Not everything can be auto-tested. Sometimes, the “best” output is subjective (e.g., tone, creativity). Promptfoo supports human-in-the-loop evaluation: it’ll generate results and let you score them manually inside a review dashboard.

This is especially useful for marketing copy, long summaries, or conversational agents.

Closing Thoughts

For me, discovering Promptfoo was like moving from “cowboy coding” to “professional engineering” for prompts. It bridges the gap between experimentation and reliability.

If you’re serious about building production-grade AI systems, treat your prompts like code—version them, test them, benchmark them. Promptfoo makes that possible.

👉 Next time you tweak a prompt, don’t just eyeball the output. Run it through Promptfoo and let the results speak.