AI Agents  

Measure Agent Quality and Safety with Azure AI Evaluation SDK and Azure AI Foundry

A practical evaluation pipeline for GraphRAG agents with quality metrics, safety scans, and observable runs.

Introduction

In Part 4, we orchestrated multiple agents. This article (part 5) answers a harder question: can we prove that the system is reliable enough for production workloads?

For AI Engineers, answer quality alone is not enough. You also need:

  • Repeatable quality checks before release.

  • Safety evidence for security and compliance reviews.

  • Traceability when behavior changes after prompt, model, or tool updates.

This part adds an evaluation module under src/evaluation with three goals:

  • Quality: task completion, intent resolution, tool-call behavior, graph-grounded correctness.

  • Safety: adversarial probing with red team strategies and risk categories.

  • Observability: telemetry and artifacts that support debugging and regression analysis.

How the three goals are measured

GoalPrimary signalsCurrent evidence in this article
Qualitytask_adherence, intent_resolution, relevance, coherence, response_completenessFoundry quality snapshot (March 2026): 80% task adherence, 100% on other quality signals
SafetyRed team attack outcomes by risk category and strategyRed team run section and Foundry safety screenshots
ObservabilityPrompt/completion token usage, OTel traces, local artifactsQuality snapshot token counters (85,686 prompt / 5,048 completion) + OTel and JSON report references

Quality and safety runs can also be exported to Azure AI Foundry, so teams review outcomes in shared dashboards instead of only local JSON artifacts.

Why This Matters for AI Engineers

When you ship agent systems, every change can alter behavior: prompts, model versions, tool schemas, and data.

Engineering scenarioTypical failure modeWhy this pipeline helps
Prompt or model updateFluent but lower-quality answersBatch baselines expose quality regressions before release
Tool contract changesWrong tool or wrong argumentsTool-call evaluators detect routing and schema drift
Knowledge graph refreshUnsupported entities/relationships in answersCustom graph evaluators detect grounding errors
Safety hardeningUnknown risk exposure under adversarial inputsRed team runs provide repeatable safety evidence
Incident debuggingHard to explain why behavior changedOTel traces and result artifacts reduce investigation time

What You Build

LayerComponentPurpose
Datasetgolden_questions.jsonlControlled test set with expected outcomes
Generationgenerate_eval_data.pyRuns the agent and writes evaluation data
Quality Evalrun_batch_evaluation.pyRuns built-in and custom evaluators
Safety Evalrun_redteam.pyRuns red team scans against the agent/model
Reportingevaluation_report.md, JSON outputs, Foundry run linksHuman and machine-readable results

Module Layout

src/evaluation/
├── config.py
├── datasets/
│   ├── golden_questions.jsonl
│   └── eval_data.jsonl
├── evaluators/
│   ├── builtin.py
│   ├── entity_accuracy.py
│   └── relationship_validity.py
├── monitoring/
│   └── otel_setup.py
├── results/
└── scripts/
    ├── generate_eval_data.py
    ├── run_batch_evaluation.py
    └── run_redteam.py

What Is Evaluated, and Why

Built-in quality evaluators

EvaluatorWhat it checksWhy it matters
TaskAdherenceEvaluatorDoes the response complete the requested task?Detects incomplete or off-target answers
IntentResolutionEvaluatorDoes the response resolve user intent?Detects responses that are fluent but irrelevant
RelevanceEvaluatorIs the response relevant to the query?Detects answers that drift away from the user request
CoherenceEvaluatorIs the response logically consistent?Detects contradictions and weak reasoning flow
ResponseCompletenessEvaluatorDoes the response cover expected content?Detects partial answers against expected coverage

Built-in tool-behavior evaluator (conditional)

EvaluatorWhat it checksWhy it matters
ToolCallAccuracyEvaluatorWere tools/arguments appropriate?Detects wrong routing, wrong parameters, unnecessary calls

ToolCallAccuracyEvaluator is included when structured tool_call payloads exist in eval_data.jsonl.

Custom graph evaluators

EvaluatorWhat it checksWhy it matters
EntityAccuracyEvaluatorMentioned entities exist in Parquet graph dataDetects unsupported entities and weak grounding
RelationshipValidityEvaluatorCo-mentioned entity pairs match graph relationshipsDetects fabricated links between entities

Safety scanning

ComponentWhat it checksWhy it matters
RedTeam scanAttack outcomes by risk category and strategyProduces safety evidence and failure patterns

Pipeline Steps

Step 1: Start MCP Server

poetry run python run_mcp_server.py

Step 2: Generate Evaluation Data

poetry run python -m evaluation.scripts.generate_eval_data

This runs the Knowledge Captain against 10 golden questions and writes eval_data.jsonl.

Step 3: Run Batch Evaluation

poetry run python -m evaluation.scripts.run_batch_evaluation

Optional variants:

# Skip custom graph evaluators
poetry run python -m evaluation.scripts.run_batch_evaluation --no-custom

# Publish quality run to New Foundry (dashboard + report URL)
poetry run python -m evaluation.scripts.run_batch_evaluation --foundry
02-quality-run-details

Batch quality run in Azure AI Foundry, including evaluator metrics and row-level evidence.

Step 4: Run Red Team Scan

poetry run python -m evaluation.scripts.run_redteam --flow cloud-model

Use cloud-model as default for predictable behavior.

03-redteam-run-details

Red team safety run in Azure AI Foundry, showing risk-category outcomes and attack results.

Where Azure AI Foundry Fits

Part 5 uses Azure AI Foundry as the shared visualization layer for evaluation operations:

  • Step 3 (--foundry): publishes a New Foundry quality run and returns a report URL for dashboard review. The default Foundry quality set emphasizes semantic signals (relevancecoherenceresponse_completeness) plus agent checks (task_adherenceintent_resolution).

  • Step 4 (run_redteam): runs the red team scan and publishes a New Foundry reference run for safety visibility.

  • Custom graph evaluators: execute in the same batch pipeline and are persisted in local artifacts (evaluation_results.jsonevaluation_report.md) that are reviewed alongside Foundry run links.

By design, lexical overlap metrics such as F1 are not the default in Foundry export for this agent workflow, because they can under-score correct but paraphrased answers.

This gives one operational workflow: Foundry for centralized run visibility, local custom metrics for graph-specific grounding checks.

Latest Foundry quality snapshot (March 2026)

Most recent quality run summary (10 rows):

MetricScoreRows
Task adherence80%8/10
Intent resolution100%10/10
Relevance100%10/10
Coherence100%10/10
Response completeness100%10/10
Prompt tokens85,686-
Completion tokens5,048-

How to interpret this snapshot:

  • Semantic quality is stable across the full set.

  • task_adherence is the primary optimization target.

  • ToolCallAccuracyEvaluator is emitted only when eval_data.jsonl includes structured tool_call payloads.

01-evaluations-list

Azure AI Foundry evaluations list used as the central run registry for Part 5.

Key Snippets That Matter

1. Message conversion (MAF to evaluator schema)

The SDK expects OpenAI-style tool_call and tool_result. MAF internally uses function messages.

from evaluation.evaluators.builtin import convert_to_evaluator_messages

messages = convert_to_evaluator_messages(all_msgs)

Without this conversion, tool-focused evaluators are unreliable.

2. Correct evaluator_config mapping shape

evaluator_config = {
    "task_adherence": {
        "column_mapping": {
            "query": "${data.query}",
            "response": "${data.response}",
        }
    }
}

Flattened mappings break field binding.

3. Deployment compatibility guard

Some deployments reject max_tokens and require max_completion_tokens.

if "intent_resolution" in evaluators and not _supports_legacy_max_tokens(config):
    evaluators.pop("intent_resolution", None)

This keeps the run operational while preserving the rest of the evaluation set.

4. Red team semantic success guard

total_attacks = _extract_total_evaluated_attacks(result_payload)
if total_attacks == 0:
    raise RuntimeError("Red team scan completed but produced zero evaluated attacks.")

This prevents false-success runs in unsupported regions.

Interpreting Results for Release Decisions

Do not treat one score as the whole truth. Use a small gate matrix.

04-release-gate-view

Foundry evaluation evidence used as release-gate input, not only as post-run reporting.

DimensionSignal to watchPractical release question
Task qualityTaskAdherence, IntentResolutionAre answers complete and aligned with user goal?
Tool behaviorToolCallAccuracyIs orchestration stable after changes?
Graph groundingEntityAccuracy, RelationshipValidityAre claims supported by the knowledge graph?
SafetyRed team ASR and risk outcomesDid risk exposure improve, regress, or stay flat?
TraceabilityOTel traces + run artifactsCan we explain failures quickly?

Recommended practice:

  • Compare against previous baseline, not isolated absolute values.

  • Block release on clear regressions in critical metrics.

  • Keep known exceptions documented and time-boxed.

  • Treat missing ToolCallAccuracy in a run as dataset-shape not-applicable (no structured tool calls), not as an automatic failure.

Common Failure Patterns

PatternTypical causeMitigation
Missing or wrong evaluator columnsIncorrect evaluator_config shapeUse nested column_mapping
Intermittent evaluator failureDeployment incompatibility for token paramsUse evaluator-only deployment override
Red team run shows 0/0Region capability mismatchMove Foundry project to supported region
Good text but poor groundingResponse not constrained by graph evidenceAdd graph checks and update prompts
Hard-to-debug regressionsMissing traces/artifactsKeep OTel + result JSON in every run

Environment Variables

VariableRequiredPurpose
AZURE_OPENAI_ENDPOINTYesAzure OpenAI endpoint
AZURE_OPENAI_API_KEYYesAzure OpenAI key
AZURE_OPENAI_CHAT_DEPLOYMENTNoDefault evaluator/chat deployment
AZURE_OPENAI_EVAL_CHAT_DEPLOYMENTNoEvaluator-only deployment override
AZURE_AI_PROJECTStep 4New Foundry project endpoint
APPLICATIONINSIGHTS_CONNECTION_STRINGNoProduction telemetry sink
OTEL_TRACING_ENDPOINTNoLocal OTLP endpoint

Recommended Foundry Screenshots

  • Evaluations list view for portfolio-level evidence.

  • Batch quality run details with summary and row-level metrics.

  • Red team run details with risk categories and outcomes.

  • Release-gate evidence view for decision making.

  • Prefer detailed run views over reduced pages that only show token counts.

Validation Snapshot

Current module tests:

FileTests
tests/evaluation/test_config.py15
tests/evaluation/test_builtin_evaluators.py21
tests/evaluation/test_custom_evaluators.py14
tests/evaluation/test_monitoring.py6
tests/evaluation/test_run_redteam.py7
Total63

Why This Part Is a Milestone

Part 5 turns the project from a functional demo into an engineering-grade evaluable system.

  • You can compare behavior over time.

  • You can detect regressions before production.

  • You can produce safety evidence in a repeatable way.

  • You can keep a traceable path from query to metric.

This is the baseline for Part 6 (Human-in-the-Loop) and production quality gates.

Key Takeaways

  • Agent quality in production is not a single score. You need quality, safety, and traceability together.

  • Built-in evaluators and custom graph evaluators solve different problems and should be used as a combined gate.

  • Azure AI Foundry gives shared visibility for runs, while local artifacts preserve GraphRAG-specific evidence.

  • Missing ToolCallAccuracy is often a dataset-shape condition, not automatically a regression.

  • Red team outcomes should be treated as release evidence, not as an optional post-check.

What's Next

In Part 6, we will add Human-in-the-Loop controls to the same pipeline:

  • Approval gates for sensitive tool actions.

  • Explicit escalation paths for low-confidence responses.

  • Audit-friendly checkpoints connected to the same evaluation workflow.

Resources