Measuring Retrieval vs. Generation Quality Independently using LangGraph

Tuhin Paul
1d
107
0
0

Article

Introduction

When evaluating a Retrieval-Augmented Generation (RAG) system, engineers often fall into the Monolithic Evaluation Trap. They look at the final answer and ask, "Is this answer correct?"

While this tells you if the system failed, it doesn't tell you why it failed. Did the retriever fetch the wrong documents? Or did the retriever fetch the perfect documents, but the LLM hallucinate the final answer?

To build production-grade RAG systems, you must decouple retrieval evaluation from generation evaluation. In this end-to-end article, we will explore how to build an independent evaluation pipeline using LangGraph, allowing you to pinpoint exact bottlenecks in a real-world e-commerce customer support use case.

The Real-World Use Case: "ElectroSupport"

Imagine you are the Lead AI Engineer for TechGadgets Inc., an e-commerce retailer. You have deployed a RAG-powered customer support bot called ElectroSupport.

Customers ask complex questions such as:

"What is the return policy for an opened MacBook Pro?"
"How do I factory reset my SmartHome Hub if the LED is blinking red?"

The Problem

Customer satisfaction is dropping. When you look at the logs, some answers are completely wrong.

You need an automated, real-time evaluation pipeline that runs in the background, scoring every interaction to tell you exactly which component is degrading.

The Metrics: Decoupling the Pipeline

To evaluate independently, we divide the metrics into two distinct categories.

1. Retrieval Quality (Did We Get the Right Context?)

Context Precision

Of the documents retrieved, how many are actually relevant to the question?
Measures ranking quality and relevance.

Context Recall

Did the retriever fetch all the necessary information from the ground truth required to answer the question?

2. Generation Quality (Did the LLM Do a Good Job with the Context?)

Faithfulness

Is the generated answer entirely derived from the retrieved context?
Measures hallucination.

Answer Relevancy

Does the final answer directly address the user's question?

The LangGraph Architecture

LangGraph is ideal for this scenario because evaluation is inherently a stateful, multi-step workflow.

We will build an Evaluation Graph that takes:

User question
Retrieved context
Generated answer
Ground truth context

and routes them through independent LLM-as-a-Judge nodes.

Graph Flow

Input State → Receives question, retrieved_context, generated_answer, and ground_truth_context.
Evaluate Retrieval Node → Calculates Context Precision.
Evaluate Generation Node → Calculates Faithfulness.
Diagnostic Router → Categorizes failures such as Retrieval Failure, Hallucination, or Success.

End-to-End Implementation

Prerequisites

pip install langgraph langchain-openai langchain-core pydantic

Note: Ensure your OPENAI_API_KEY is configured in the environment.

Step 1: Define the State and Prompts

We use Pydantic for structured output so that LLM judges return clean, parseable scores.

import os
from typing import List, Literal
from typing_extensions import TypedDict
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langgraph.graph import StateGraph, END

# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# --- Pydantic Models for Structured Evaluation ---
class RetrievalScore(BaseModel):
    reasoning: str = Field(description="Brief reasoning for the score")
    precision_score: float = Field(description="Score from 0.0 to 1.0")

class GenerationScore(BaseModel):
    reasoning: str = Field(description="Brief reasoning for the score")
    faithfulness_score: float = Field(description="Score from 0.0 to 1.0")

# --- LangGraph State ---
class EvalState(TypedDict):
    question: str
    ground_truth_context: List[str]
    retrieved_context: List[str]
    generated_answer: str
    retrieval_score: float
    retrieval_reasoning: str
    generation_score: float
    generation_reasoning: str
    diagnosis: str

Step 2: Build the Evaluation Nodes

The retrieval evaluator only examines the question and contexts, while the generation evaluator only examines the retrieved context and generated answer.

Node 1: Evaluate Retrieval (Context Precision)

def evaluate_retrieval(state: EvalState) -> dict:
    prompt = ChatPromptTemplate.from_template(
        """You are an expert evaluator. Evaluate the RETRIEVAL quality.
        Question: {question}
        Ground Truth Context: {ground_truth}
        Retrieved Context: {retrieved}

        Score the precision from 0.0 to 1.0.
        1.0 means all retrieved docs are highly relevant.
        0.0 means none are relevant.
        """
    )

    structured_llm = llm.with_structured_output(RetrievalScore)
    chain = prompt | structured_llm

    result = chain.invoke({
        "question": state["question"],
        "ground_truth": "\n".join(state["ground_truth_context"]),
        "retrieved": "\n".join(state["retrieved_context"])
    })

    return {
        "retrieval_score": result.precision_score,
        "retrieval_reasoning": result.reasoning
    }

Node 2: Evaluate Generation (Faithfulness)

def evaluate_generation(state: EvalState) -> dict:
    prompt = ChatPromptTemplate.from_template(
        """You are an expert evaluator. Evaluate the GENERATION quality.
        Retrieved Context: {retrieved}
        Generated Answer: {answer}

        Score the faithfulness from 0.0 to 1.0.
        1.0 means the answer is 100% supported by the context.
        0.0 means the answer contains claims not found in the context.
        """
    )

    structured_llm = llm.with_structured_output(GenerationScore)
    chain = prompt | structured_llm

    result = chain.invoke({
        "retrieved": "\n".join(state["retrieved_context"]),
        "answer": state["generated_answer"]
    })

    return {
        "generation_score": result.faithfulness_score,
        "generation_reasoning": result.reasoning
    }

Node 3: Diagnostic Router

def diagnose_pipeline(state: EvalState) -> dict:
    ret_score = state["retrieval_score"]
    gen_score = state["generation_score"]

    # Thresholds for production alerting
    if ret_score < 0.6 and gen_score < 0.6:
        diagnosis = "CRITICAL: Both Retrieval and Generation failed."

    elif ret_score < 0.6:
        diagnosis = (
            "RETRIEVAL FAILURE: Context is irrelevant. "
            "Check chunking, embeddings, or vector DB."
        )

    elif gen_score < 0.6:
        diagnosis = (
            "GENERATION FAILURE: Hallucination detected. "
            "Check LLM prompt or temperature."
        )

    else:
        diagnosis = "SUCCESS: Pipeline performed well."

    return {"diagnosis": diagnosis}

Step 3: Compile the LangGraph

Now we connect all nodes into a Directed Acyclic Graph (DAG).

workflow = StateGraph(EvalState)

# Add Nodes
workflow.add_node("evaluate_retrieval", evaluate_retrieval)
workflow.add_node("evaluate_generation", evaluate_generation)
workflow.add_node("diagnose", diagnose_pipeline)

# Define Edges
workflow.set_entry_point("evaluate_retrieval")
workflow.add_edge("evaluate_retrieval", "evaluate_generation")
workflow.add_edge("evaluate_generation", "diagnose")
workflow.add_edge("diagnose", END)

# Compile
eval_graph = workflow.compile()

Step 4: Run the Real-Time Use Case

Let's simulate a real-time customer query entering the monitoring pipeline.

production_log = {
    "question": "What is the return policy for an opened MacBook Pro?",
    "ground_truth_context": [
        "Opened electronics like laptops can be returned within 14 days if in original condition.",
        "MacBook Pro returns require all original accessories and packaging."
    ],

    # Simulating a bad retrieval
    "retrieved_context": [
        "The standard warranty for a MacBook Pro is 1 year from the date of purchase.",
        "AppleCare+ extends the warranty to 3 years and covers accidental damage."
    ],

    # LLM answering based on incorrect retrieval
    "generated_answer":
        "The return policy for an opened MacBook Pro is covered by the 1-year standard warranty, which can be extended with AppleCare+."
}

print("--- Running LangGraph Evaluation Pipeline ---")

final_state = eval_graph.invoke(production_log)

print(f"\n[Question]: {final_state['question']}")

print(
    f"\n[Retrieval Evaluation] "
    f"(Score: {final_state['retrieval_score']:.2f})"
)

print(f"Reasoning: {final_state['retrieval_reasoning']}")

print(
    f"\n[Generation Evaluation] "
    f"(Score: {final_state['generation_score']:.2f})"
)

print(f"Reasoning: {final_state['generation_reasoning']}")

print(f"\n[Final Diagnosis]: {final_state['diagnosis']}")

Analyzing the Output: The "Aha!" Moment

The output would look similar to:

--- Running LangGraph Evaluation Pipeline ---

[Question]: What is the return policy for an opened MacBook Pro?

[Retrieval Evaluation] (Score: 0.10)
Reasoning: The retrieved context discusses warranties and AppleCare+, which are unrelated to the return policy question.

[Generation Evaluation] (Score: 0.90)
Reasoning: The generated answer is fully grounded in the retrieved context and does not hallucinate.

[Final Diagnosis]:
RETRIEVAL FAILURE: Context is irrelevant. Check chunking, embeddings, or vector DB.

Why This Is a Game-Changer

A monolithic evaluator would simply report:

Answer Incorrect (Score: 0.2)

That tells you the answer is wrong but provides no insight into the root cause.

By decoupling evaluation:

What We Learned

Do not modify the LLM prompt.
The generation score is 0.90, indicating the LLM behaved correctly.
The retrieval score is 0.10, indicating poor context retrieval.

What Needs to Be Fixed

Improve vector search quality.
Review chunking strategy.
Revisit embedding models.
Implement hybrid search (semantic + keyword).
Add re-ranking before generation.

Taking It to Production (Real-Time Monitoring)

In production, this evaluation graph should operate asynchronously.

Async Evaluation

The primary RAG application streams answers to users.
Simultaneously publishes (question, context, answer) tuples to a queue.

LangGraph Worker

Consumes events from the queue.
Executes the evaluation graph.

Dashboarding

Push scores to:
- LangSmith
- Datadog
- Grafana
- Prometheus

Alerting

Examples:

If retrieval_score < 0.6 for more than 50 queries per hour, alert the Data Engineering team.
If generation_score < 0.6, alert the ML team to investigate prompt drift or model degradation.

Conclusion

Evaluating RAG as a single black box creates unnecessary complexity and slows down troubleshooting. By separating retrieval evaluation from generation evaluation, teams gain precise visibility into where failures occur.

Using LangGraph, you can build stateful evaluation workflows that not only measure performance but also diagnose problems automatically. This approach makes RAG systems more reliable, maintainable, and production-ready.