Introduction
When evaluating a Retrieval-Augmented Generation (RAG) system, engineers often fall into the Monolithic Evaluation Trap. They look at the final answer and ask, "Is this answer correct?"
While this tells you if the system failed, it doesn't tell you why it failed. Did the retriever fetch the wrong documents? Or did the retriever fetch the perfect documents, but the LLM hallucinate the final answer?
To build production-grade RAG systems, you must decouple retrieval evaluation from generation evaluation. In this end-to-end article, we will explore how to build an independent evaluation pipeline using LangGraph, allowing you to pinpoint exact bottlenecks in a real-world e-commerce customer support use case.
The Real-World Use Case: "ElectroSupport"
Imagine you are the Lead AI Engineer for TechGadgets Inc., an e-commerce retailer. You have deployed a RAG-powered customer support bot called ElectroSupport.
Customers ask complex questions such as:
The Problem
Customer satisfaction is dropping. When you look at the logs, some answers are completely wrong.
You need an automated, real-time evaluation pipeline that runs in the background, scoring every interaction to tell you exactly which component is degrading.
The Metrics: Decoupling the Pipeline
To evaluate independently, we divide the metrics into two distinct categories.
1. Retrieval Quality (Did We Get the Right Context?)
Context Precision
Of the documents retrieved, how many are actually relevant to the question?
Measures ranking quality and relevance.
Context Recall
2. Generation Quality (Did the LLM Do a Good Job with the Context?)
Faithfulness
Answer Relevancy
The LangGraph Architecture
LangGraph is ideal for this scenario because evaluation is inherently a stateful, multi-step workflow.
We will build an Evaluation Graph that takes:
User question
Retrieved context
Generated answer
Ground truth context
and routes them through independent LLM-as-a-Judge nodes.
Graph Flow
Input State → Receives question, retrieved_context, generated_answer, and ground_truth_context.
Evaluate Retrieval Node → Calculates Context Precision.
Evaluate Generation Node → Calculates Faithfulness.
Diagnostic Router → Categorizes failures such as Retrieval Failure, Hallucination, or Success.
End-to-End Implementation
Prerequisites
pip install langgraph langchain-openai langchain-core pydantic
Note: Ensure your OPENAI_API_KEY is configured in the environment.
Step 1: Define the State and Prompts
We use Pydantic for structured output so that LLM judges return clean, parseable scores.
import os
from typing import List, Literal
from typing_extensions import TypedDict
from pydantic import BaseModel, Field
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langgraph.graph import StateGraph, END
# Initialize LLM
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
# --- Pydantic Models for Structured Evaluation ---
class RetrievalScore(BaseModel):
reasoning: str = Field(description="Brief reasoning for the score")
precision_score: float = Field(description="Score from 0.0 to 1.0")
class GenerationScore(BaseModel):
reasoning: str = Field(description="Brief reasoning for the score")
faithfulness_score: float = Field(description="Score from 0.0 to 1.0")
# --- LangGraph State ---
class EvalState(TypedDict):
question: str
ground_truth_context: List[str]
retrieved_context: List[str]
generated_answer: str
retrieval_score: float
retrieval_reasoning: str
generation_score: float
generation_reasoning: str
diagnosis: str
![70]()
Step 2: Build the Evaluation Nodes
The retrieval evaluator only examines the question and contexts, while the generation evaluator only examines the retrieved context and generated answer.
Node 1: Evaluate Retrieval (Context Precision)
def evaluate_retrieval(state: EvalState) -> dict:
prompt = ChatPromptTemplate.from_template(
"""You are an expert evaluator. Evaluate the RETRIEVAL quality.
Question: {question}
Ground Truth Context: {ground_truth}
Retrieved Context: {retrieved}
Score the precision from 0.0 to 1.0.
1.0 means all retrieved docs are highly relevant.
0.0 means none are relevant.
"""
)
structured_llm = llm.with_structured_output(RetrievalScore)
chain = prompt | structured_llm
result = chain.invoke({
"question": state["question"],
"ground_truth": "\n".join(state["ground_truth_context"]),
"retrieved": "\n".join(state["retrieved_context"])
})
return {
"retrieval_score": result.precision_score,
"retrieval_reasoning": result.reasoning
}
Node 2: Evaluate Generation (Faithfulness)
def evaluate_generation(state: EvalState) -> dict:
prompt = ChatPromptTemplate.from_template(
"""You are an expert evaluator. Evaluate the GENERATION quality.
Retrieved Context: {retrieved}
Generated Answer: {answer}
Score the faithfulness from 0.0 to 1.0.
1.0 means the answer is 100% supported by the context.
0.0 means the answer contains claims not found in the context.
"""
)
structured_llm = llm.with_structured_output(GenerationScore)
chain = prompt | structured_llm
result = chain.invoke({
"retrieved": "\n".join(state["retrieved_context"]),
"answer": state["generated_answer"]
})
return {
"generation_score": result.faithfulness_score,
"generation_reasoning": result.reasoning
}
Node 3: Diagnostic Router
def diagnose_pipeline(state: EvalState) -> dict:
ret_score = state["retrieval_score"]
gen_score = state["generation_score"]
# Thresholds for production alerting
if ret_score < 0.6 and gen_score < 0.6:
diagnosis = "CRITICAL: Both Retrieval and Generation failed."
elif ret_score < 0.6:
diagnosis = (
"RETRIEVAL FAILURE: Context is irrelevant. "
"Check chunking, embeddings, or vector DB."
)
elif gen_score < 0.6:
diagnosis = (
"GENERATION FAILURE: Hallucination detected. "
"Check LLM prompt or temperature."
)
else:
diagnosis = "SUCCESS: Pipeline performed well."
return {"diagnosis": diagnosis}
Step 3: Compile the LangGraph
Now we connect all nodes into a Directed Acyclic Graph (DAG).
workflow = StateGraph(EvalState)
# Add Nodes
workflow.add_node("evaluate_retrieval", evaluate_retrieval)
workflow.add_node("evaluate_generation", evaluate_generation)
workflow.add_node("diagnose", diagnose_pipeline)
# Define Edges
workflow.set_entry_point("evaluate_retrieval")
workflow.add_edge("evaluate_retrieval", "evaluate_generation")
workflow.add_edge("evaluate_generation", "diagnose")
workflow.add_edge("diagnose", END)
# Compile
eval_graph = workflow.compile()
Step 4: Run the Real-Time Use Case
Let's simulate a real-time customer query entering the monitoring pipeline.
production_log = {
"question": "What is the return policy for an opened MacBook Pro?",
"ground_truth_context": [
"Opened electronics like laptops can be returned within 14 days if in original condition.",
"MacBook Pro returns require all original accessories and packaging."
],
# Simulating a bad retrieval
"retrieved_context": [
"The standard warranty for a MacBook Pro is 1 year from the date of purchase.",
"AppleCare+ extends the warranty to 3 years and covers accidental damage."
],
# LLM answering based on incorrect retrieval
"generated_answer":
"The return policy for an opened MacBook Pro is covered by the 1-year standard warranty, which can be extended with AppleCare+."
}
print("--- Running LangGraph Evaluation Pipeline ---")
final_state = eval_graph.invoke(production_log)
print(f"\n[Question]: {final_state['question']}")
print(
f"\n[Retrieval Evaluation] "
f"(Score: {final_state['retrieval_score']:.2f})"
)
print(f"Reasoning: {final_state['retrieval_reasoning']}")
print(
f"\n[Generation Evaluation] "
f"(Score: {final_state['generation_score']:.2f})"
)
print(f"Reasoning: {final_state['generation_reasoning']}")
print(f"\n[Final Diagnosis]: {final_state['diagnosis']}")
Analyzing the Output: The "Aha!" Moment
The output would look similar to:
--- Running LangGraph Evaluation Pipeline ---
[Question]: What is the return policy for an opened MacBook Pro?
[Retrieval Evaluation] (Score: 0.10)
Reasoning: The retrieved context discusses warranties and AppleCare+, which are unrelated to the return policy question.
[Generation Evaluation] (Score: 0.90)
Reasoning: The generated answer is fully grounded in the retrieved context and does not hallucinate.
[Final Diagnosis]:
RETRIEVAL FAILURE: Context is irrelevant. Check chunking, embeddings, or vector DB.
Why This Is a Game-Changer
A monolithic evaluator would simply report:
Answer Incorrect (Score: 0.2)
That tells you the answer is wrong but provides no insight into the root cause.
By decoupling evaluation:
What We Learned
Do not modify the LLM prompt.
The generation score is 0.90, indicating the LLM behaved correctly.
The retrieval score is 0.10, indicating poor context retrieval.
What Needs to Be Fixed
Improve vector search quality.
Review chunking strategy.
Revisit embedding models.
Implement hybrid search (semantic + keyword).
Add re-ranking before generation.
Taking It to Production (Real-Time Monitoring)
In production, this evaluation graph should operate asynchronously.
Async Evaluation
The primary RAG application streams answers to users.
Simultaneously publishes (question, context, answer) tuples to a queue.
LangGraph Worker
Dashboarding
Push scores to:
LangSmith
Datadog
Grafana
Prometheus
Alerting
Examples:
If retrieval_score < 0.6 for more than 50 queries per hour, alert the Data Engineering team.
If generation_score < 0.6, alert the ML team to investigate prompt drift or model degradation.
Conclusion
Evaluating RAG as a single black box creates unnecessary complexity and slows down troubleshooting. By separating retrieval evaluation from generation evaluation, teams gain precise visibility into where failures occur.
Using LangGraph, you can build stateful evaluation workflows that not only measure performance but also diagnose problems automatically. This approach makes RAG systems more reliable, maintainable, and production-ready.