Evaluating Generative AI Models and Agents in Azure AI Foundry

Prathap Reddy
4h
1.7k
0
1

Article

Azure AI Foundry provides a built-in evaluation system that helps you systematically test how well your generative AI models and agents perform, and how safe their outputs are, using your own datasets or synthetically generated data. It lets you run automated evaluations at scale, view detailed metrics, and manage evaluators so you can improve your app with clear evidence instead of guesswork.

What evaluation in Foundry does

Evaluation in Azure AI Foundry means sending a test dataset through your model or agent and then scoring the outputs with a set of metrics. These metrics can cover quality (how good and accurate the responses are) and safety (whether the content is harmful or risky), and can be run in bulk for many rows of test data.

You can evaluate

A model or agent directly, by providing input data and letting Foundry generate outputs to score.
A dataset that already contains model or agent outputs, if you have run them before and just want to compute metrics on the stored responses.

Prerequisites and data formats

To use evaluation, you need a test dataset in CSV or JSON Lines (JSONL) format, with columns for prompts, responses, and sometimes ground-truth answers. For AI-assisted quality metrics (where another GPT model acts as a judge), you also need an Azure OpenAI connection with a deployment of GPT‑3.5, GPT‑4, or Davinci.

Foundry supports

Uploading your own datasets from local files.
Selecting existing datasets already stored in the project.
Generating synthetic datasets, when enabled in your region, by describing the kind of test data you want and how many rows you need.

Ways to start an evaluation

You can start an evaluation in multiple places in the Azure AI Foundry portal. The main entry points are:

Evaluation page: Go to Evaluation → Create (or “Create a new evaluation”) to launch a guided wizard and choose whether you are testing a model, an agent, or a dataset.
Model catalog: Open a model, go to the Benchmarks tab, and choose “Try with your own data” to start an evaluation for that specific model.
Playgrounds: From a model or agent playground, select Evaluation → Create or Metrics → Run full evaluation to evaluate what you’ve been prototyping.

In all cases, the wizard walks you through selecting the evaluation target, dataset, metrics, and data mapping before you submit the run.

Choosing the evaluation target

When you begin from the Evaluation page, you first pick what you want to evaluate. The supported targets are:

Model: Sends your dataset’s inputs to a chosen model and evaluates its generated responses.
Agent: Similar to model evaluation, but the target is an agent that might orchestrate tools and reasoning steps.
Dataset: Does not call a model or agent; it assumes your dataset already contains outputs and just calculates metrics on those output columns.

Choosing the right target ensures the evaluation is aligned with how your application actually behaves in production.

Working with datasets

For model or agent targets, your dataset acts as the set of prompts or inputs that will be sent to them for evaluation. You can:

Add a new dataset: Upload CSV or JSONL containing inputs and, optionally, expected outputs (ground truth).
Choose existing dataset: Reuse a dataset you already imported or generated in the project.
Generate synthetic dataset: Ask Foundry to generate test rows from a prompt description and a chosen model resource, optionally enhancing realism with uploaded files.

Synthetic dataset generation is only available in regions that support the Azure OpenAI Responses API, and may not appear in all regions.

Types of evaluation metrics

Azure AI Foundry groups metrics into three main categories, plus your own custom metrics.

AI quality (AI assisted)

Uses an LLM (for example, GPT‑4) as a “judge” to score qualities like groundedness, relevance, coherence, fluency, and GPT similarity.

Requires a model deployment to act as the evaluator and is well suited for nuanced text quality judgments that are hard to capture with simple math.

AI quality (NLP metrics)

Uses classical NLP metrics like F1, ROUGE, BLEU, GLEU, and METEOR to compare model outputs against ground-truth references.

Usually requires a “ground truth” column in your dataset for each row so metrics can compute overlap or similarity.

Risk and safety metrics

Targets harmful or unsafe content types: self harm, hateful or unfair content, violent content, sexual content, protected material, and indirect attacks.

Foundry automatically provisions a safety model to analyze responses and produce severity scores and rationales without requiring you to deploy your own judge model.

Custom evaluators

You can define your own evaluators to measure custom attributes like “friendliness” or “brand tone” using code or prompt-based definitions.

Prompt-based evaluators use Prompty files (.prompty) that define the model, inputs, and scoring logic, and they then appear alongside Microsoft-curated evaluators in the Evaluator library.

Data mapping requirements

Before running the evaluation, you must map columns from your dataset to the fields each metric expects, such as Query, Response, Context, and Ground truth. Foundry tries to auto map fields based on column names, but you can adjust mappings to ensure they are correct.

Some examples from the metric requirements table:

Groundedness, Coherence, Fluency, and Relevance require a Query and a Response, and Relevance also requires Context.
GPT-similarity and the NLP metrics (F1, BLEU, GLEU, METEOR, ROUGE) require a Response and a Ground truth.
Safety metrics like Self-harm, Hateful and unfair content, and Sexual content require a Query and Response but no Ground truth.
Accurate mapping is critical, because incorrect mappings can make your evaluation scores meaningless or misleading.

Running and reviewing an evaluation

Once you’ve selected the target, dataset, metrics, and mappings, you optionally name your evaluation and then submit it. The evaluation runs in the background, scoring each row of the dataset with the chosen metrics and storing results in the project’s storage.

After it finishes, you can

View overall metrics and per-row details on the evaluation results page.
Inspect specific examples where the model or agent performed poorly or produced risky content.
Use these insights to refine prompts, adjust model settings, retrain, or change guardrails and policies.

For model evaluations that use generated sample questions, the dataset that Foundry creates is saved to the project’s blob storage so you can reuse it or compare across runs.

Evaluator library and versioning

The Evaluator library is the central place to browse all Microsoft-curated and custom evaluators available in your project. It shows details such as the evaluator’s name, description, parameters, and associated files like Prompty definitions.

You can

See how Microsoft curated evaluators are defined, including prompts for quality graders and definitions for safety metrics.
Manage versions of evaluators, compare versions, and roll back to earlier ones as your evaluation strategy evolves.
This library makes it easier to standardize how teams measure quality and safety across multiple apps and agents.

Special notes and limitations

There are a few important caveats when using evaluations in Azure AI Foundry, especially for users who migrated from the older Azure OpenAI portal.

Evaluations created directly through the Azure OpenAI API on oai.azure.com are not visible in the Foundry (ai.azure.com) portal, and you must still go back to the old portal to see those.

You cannot use the Azure OpenAI API itself to run evaluations inside Foundry, but you can use the built-in evaluators from Foundry’s dataset evaluation flow.

Fine-tuned model evaluation is not supported if the deployment was migrated from Azure OpenAI to Foundry.

For bring-your-own storage, your storage account must be added to the Foundry account, use Microsoft Entra ID authentication, and grant the project access via the Azure portal, or you may see service errors.

By combining these capabilities dataset management, rich metrics, safety checks, and reusable evaluators Azure AI Foundry gives you a structured way to measure, compare, and improve generative AI applications before and after they go into production.