For years, LLM evaluation has been treated as a post-hoc activity: generate a batch of outputs, run them through a scoring script (BLEU, ROUGE, or an LLM-as-a-Judge), and calculate an aggregate metric. While useful for benchmarking, this static evaluation paradigm fails to improve the output in real-time.
What if evaluation wasn't just a measurement, but a control signal?
Enter Stateful Evaluation Loops. By embedding the evaluator directly into the generation pipeline, we can create systems that generate, critique, reflect, and refine autonomously. LangGraph, with its native support for cyclic graphs, mutable state, and conditional routing, is the perfect architectural foundation for this pattern.
This article provides an end-to-end guide to building a stateful evaluation loop in LangGraph, complete with production-ready code and deep architectural insights.
The Architecture of a Stateful Eval Loop
A stateful evaluation loop transforms a linear pipeline into a cyclic graph. The core components are:
The State Schema: A shared, mutable memory that tracks the input, current output, evaluation feedback, and attempt count.
The Generator Node: Produces the initial response or refines the response based on historical feedback.
The Evaluator Node: Acts as an "LLM-as-a-Judge," analyzing the current output against predefined criteria and returning structured feedback (Pass/Fail + Critique).
The Conditional Router: The brain of the loop. It decides whether to accept the output (exit the loop) or send it back to the Generator with the evaluator's feedback (continue the loop), while enforcing a max_attempts guardrail.
Step-by-Step Implementation
We will build a system that generates a technical explanation, evaluates it for clarity and accuracy, and iteratively refines it until it passes the evaluation or hits a maximum attempt limit.
1. Setup and Dependencies
Ensure you have the latest versions of LangGraph and LangChain. We will use pydantic for strict structured output from our evaluator, which is critical for preventing loop-breaking parsing errors.
pip install langgraph langchain-openai pydantic
2. Defining the State Schema
The state is the single source of truth. Notice how we accumulate feedback_history. This allows the generator to learn from past mistakes, preventing it from making the same error twice.
from typing import TypedDict, Annotated, List, Optional
from langgraph.graph.message import add_messages
class EvaluationState(TypedDict):
user_input: str
current_response: str
feedback_history: Annotated[List[str], add_messages] # Accumulates feedback
is_passed: bool
attempts: int
max_attempts: int
3. The Generator Node
The generator is state-aware. On the first attempt, it uses only the user_input. On subsequent attempts, it prepends the feedback_history to instruct the LLM on how to improve.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
# Use a capable model for generation
generator_llm = ChatOpenAI(model="gpt-4o", temperature=0.7)
def generator_node(state: EvaluationState):
attempts = state["attempts"]
if attempts == 0:
# First attempt: just answer the prompt
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert technical writer. Provide a clear, accurate response."),
("human", "{input}")
])
messages = prompt.format_messages(input=state["user_input"])
else:
# Subsequent attempts: incorporate feedback
feedback_context = "\n".join([f"- Attempt {i+1} Feedback: {fb}" for i, fb in enumerate(state["feedback_history"])])
prompt = ChatPromptTemplate.from_messages([
("system", "You are an expert technical writer. Improve your previous response based on the following feedback.\n\nFeedback History:\n{feedback}"),
("human", "Original Request: {input}\n\nYour Previous Response: {prev_response}\n\nProvide the improved response.")
])
messages = prompt.format_messages(
feedback=feedback_context,
input=state["user_input"],
prev_response=state["current_response"]
)
response = generator_llm.invoke(messages)
return {
"current_response": response.content,
"attempts": attempts + 1,
"is_passed": False # Reset to False, evaluator will update this
}
4. The Evaluator Node (LLM-as-a-Judge)
Insight: Never use raw text parsing for evaluators in a loop. A single malformed JSON response will crash your graph. Always use Pydantic structured output.
from pydantic import BaseModel, Field
class EvaluationResult(BaseModel):
passed: bool = Field(description="True if the response meets all quality criteria, False otherwise.")
feedback: str = Field(description="Specific, actionable feedback for improvement if failed. Empty if passed.")
# Use a fast, reliable model for evaluation to save cost/latency
evaluator_llm = ChatOpenAI(model="gpt-4o-mini", temperature=0.0).with_structured_output(EvaluationResult)
def evaluator_node(state: EvaluationState):
prompt = ChatPromptTemplate.from_messages([
("system", """You are a strict technical evaluator. Evaluate the response based on:
1. Accuracy: Are the technical details correct?
2. Clarity: Is it easy to understand for a mid-level developer?
3. Conciseness: Is it free of unnecessary fluff?
Return a structured evaluation."""),
("human", "Request: {input}\n\nResponse to evaluate: {response}")
])
messages = prompt.format_messages(
input=state["user_input"],
response=state["current_response"]
)
result: EvaluationResult = evaluator_llm.invoke(messages)
# Update state with evaluation results
new_feedback_history = state["feedback_history"] + [result.feedback] if not result.passed else state["feedback_history"]
return {
"is_passed": result.passed,
"feedback_history": new_feedback_history
}
5. The Conditional Router
This is where the "loop" magic happens. We define the edges that dictate the flow of the graph.
from langgraph.graph import StateGraph, END
def router(state: EvaluationState):
if state["is_passed"]:
return "accept"
elif state["attempts"] >= state["max_attempts"]:
return "max_attempts_reached"
else:
return "revise"
# Build the graph
workflow = StateGraph(EvaluationState)
# Add nodes
workflow.add_node("generator", generator_node)
workflow.add_node("evaluator", evaluator_node)
# Define edges
workflow.set_entry_point("generator")
workflow.add_edge("generator", "evaluator")
# Conditional edges from evaluator
workflow.add_conditional_edges(
"evaluator",
router,
{
"accept": END,
"max_attempts_reached": END, # Or route to a human-in-the-loop node
"revise": "generator"
}
)
# Compile the graph
app = workflow.compile()
6. Running the Loop
Let's test it with a prompt that typically induces hallucinations or overly verbose answers.
initial_state = {
"user_input": "Explain how quantum entanglement can be used to build a faster-than-light communication device.",
"current_response": "",
"feedback_history": [],
"is_passed": False,
"attempts": 0,
"max_attempts": 3
}
# Run the graph and stream the steps for observability
for step in app.stream(initial_state, stream_mode="values"):
if "current_response" in step and step["current_response"]:
print(f"\n--- Attempt {step['attempts']} ---")
print(step["current_response"][:200] + "...") # Truncated for brevity
if "is_passed" in step:
print(f"\n[Evaluator] Passed: {step['is_passed']}")
if not step['is_passed'] and step['feedback_history']:
print(f"[Evaluator] Feedback: {step['feedback_history'][-1]}")
![13]()
Deep Insights & Advanced Patterns
Building the loop is only 20% of the battle. Making it robust for production requires addressing several nuanced challenges.
1. The "Hallucination Amplification" Trap
Insight: If your evaluator is weak or overly lenient, the generator might confidently double down on a hallucination, and the loop will accept it.
Solution: Implement Multi-Dimensional Evaluation. Instead of a single "Pass/Fail", have the evaluator check specific, verifiable claims. You can even use a deterministic tool (like a Python code executor or a vector database retrieval) as a co-evaluator alongside the LLM judge to ground the evaluation in facts.
2. Evaluator-as-a-Service and Caching
Running an LLM evaluator on every iteration is expensive and adds latency.
Solution: Implement semantic caching for the evaluator node. If the current_response is semantically identical to a previously evaluated response, return the cached evaluation. Frameworks like gptcache or custom Redis-backed semantic caches can reduce evaluator costs by up to 40% in iterative loops.
3. Human-in-the-Loop (HITL) as a Fallback
What happens when max_attempts is reached? Silently failing is bad UX.
Solution: Instead of routing max_attempts_reached to END, route it to a human_review_node. LangGraph’s interrupt function is perfect here:
from langgraph.types import interrupt
def human_review_node(state: EvaluationState):
print("\n[SYSTEM] Max attempts reached. Awaiting human review.")
print(f"Final Response: {state['current_response']}")
print(f"Feedback History: {state['feedback_history']}")
# Pause execution and wait for human input via the API
human_feedback = interrupt("Please provide final feedback or type 'ACCEPT' to approve.")
if human_feedback.strip().upper() == "ACCEPT":
return {"is_passed": True}
else:
# Add human feedback and allow one more generator attempt
return {
"feedback_history": state["feedback_history"] + [f"HUMAN REVIEW: {human_feedback}"],
"attempts": 0 # Reset attempts to allow the loop to continue
}
To use this, compile with app = workflow.compile(checkpointer=memory) and manage thread IDs.
4. Dynamic Routing Based on Error Type
Not all feedback is equal. A "clarity" issue might be fixed by the LLM in one step, but an "accuracy" issue might require a tool call (e.g., web search).
Insight: Structure your EvaluationResult Pydantic model to include an error_category (e.g., CLARITY, ACCURACY, FORMAT). Your router can then dynamically route to specialized nodes:
Stateful evaluation loops represent a paradigm shift from measuring AI to steering AI. By treating evaluation not as a final exam, but as a continuous feedback mechanism embedded within LangGraph's cyclic architecture, we can build agents that are self-correcting, highly reliable, and capable of producing production-grade outputs on the first try (or the second, or the third).