You’ve built a RAG system. You’ve evaluated it, and your retrieval metrics are stellar: Context Precision is 0.92, Context Recall is 0.88. The vector database is successfully fetching the exact right documents (strong top-k). Yet, your end-users are still complaining. The final answers are hallucinated, overly verbose, or fail to answer the specific question.
Welcome to the Generation Bottleneck.
When retrieval is strong but generation is weak, the problem isn't your vector database; it’s how the LLM processes, reasons over, and synthesizes the retrieved context. In this end-to-end guide, we will diagnose this exact paradox and solve it using LangGraph. We will build a self-correcting RAG pipeline using a real-world enterprise use case.
The Real-World Use Case: Live CloudOps Incident Resolution
The Scenario: You are building an AI assistant for a DevOps team.
The User Query: "The payment-gateway pod in the us-east-1 cluster is throwing OOM (Out of Memory) kills. Based on our post-mortems, what was the root cause of the similar incident last month, and what is the exact kubectl command to apply the memory limit fix?"
The Basic RAG Failure:
Your retriever perfectly fetches the 3 relevant Jira tickets and Confluence post-mortems from last month. However, the basic RAG pipeline just dumps these documents into a prompt and asks the LLM to "answer the question."
The Fix: We need the LLM to not just generate an answer, but to evaluate its own answer against the user's specific constraints (e.g., "Did I actually provide the exact kubectl command?"). If it failed, it needs to try again. This requires cyclic state management, which is exactly what LangGraph is built for.
The Architecture: A Self-Correcting LangGraph RAG
Instead of a linear pipeline (Retrieve -> Generate -> Output), we will build a graph with a Reflection Loop.
Retrieve: Fetch the top-k documents (we assume this is already working well).
Generate: The LLM drafts an initial answer.
Reflect (The Critic): A separate LLM call evaluates the draft. Is it grounded in the docs? Did it answer all parts of the prompt?
Route: If the answer is perfect, output it. If it's flawed, pass the critique back to the Generator.
Revise: The LLM rewrites the answer using the critique.
![79]()
Step-by-Step Implementation
Prerequisites
pip install langgraph langchain-openai langchain-core pydantic
Step 1: Define the State and Pydantic Models
In LangGraph, the State is the single source of truth that flows through your nodes. We will also use Pydantic to force the "Critic" LLM to output structured data.
from typing import TypedDict, List, Annotated
from langchain_core.documents import Document
from langchain_core.messages import HumanMessage, AIMessage, BaseMessage
import operator
# 1. Define the Graph State
class AgentState(TypedDict):
query: str
documents: List[Document]
messages: Annotated[List[BaseMessage], operator.add]
critique: str
iterations: int
# 2. Pydantic model for the Critic's structured output
from pydantic import BaseModel, Field
class Critique(BaseModel):
is_grounding_correct: bool = Field(description="Is the answer strictly based on the provided documents?")
is_complete: bool = Field(description="Did the answer address ALL parts of the user's query?")
feedback: str = Field(description="Specific, actionable feedback on what is missing or wrong.")
is_acceptable: bool = Field(description="Overall pass/fail for the answer.")
Step 2: Build the Nodes
Now we define the functions that will act as nodes in our graph.
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# --- NODE 1: GENERATE ---
def generate_answer(state: AgentState):
query = state["query"]
docs = state["documents"]
context = "\n\n".join([doc.page_content for doc in docs])
prompt = ChatPromptTemplate.from_template(
"""You are an expert DevOps engineer. Answer the user's query using ONLY the provided context.
Context:
{context}
Query: {query}
{critique_section}
"""
)
# If there is a critique from a previous iteration, include it
critique_section = ""
if state.get("critique"):
critique_section = f"CRITIQUE OF PREVIOUS ATTEMPT:\n{state['critique']}\nPlease fix these issues in your new answer."
chain = prompt | llm
response = chain.invoke({
"context": context,
"query": query,
"critique_section": critique_section
})
# Update iterations
iterations = state.get("iterations", 0) + 1
return {
"messages": [response],
"iterations": iterations
}
# --- NODE 2: REFLECT (CRITIC) ---
def reflect_answer(state: AgentState):
query = state["query"]
docs = state["documents"]
context = "\n\n".join([doc.page_content for doc in docs])
last_message = state["messages"][-1].content
prompt = ChatPromptTemplate.from_template(
"""You are a strict QA engineer reviewing a DevOps assistant's answer.
Context provided to the assistant:
{context}
User Query:
{query}
Assistant's Draft Answer:
{answer}
Evaluate the draft. Check if it is grounded in the context and if it fully answers the query.
"""
)
# Use structured output to force the LLM to use the Pydantic model
structured_llm = llm.with_structured_output(Critique)
chain = prompt | structured_llm
critique_obj = chain.invoke({
"context": context,
"query": query,
"answer": last_message
})
# Format the critique for the next generation step
critique_text = f"Grounding: {'Pass' if critique_obj.is_grounding_correct else 'Fail'}\n"
critique_text += f"Completeness: {'Pass' if critique_obj.is_complete else 'Fail'}\n"
critique_text += f"Feedback: {critique_obj.feedback}"
return {
"critique": critique_text,
"messages": [AIMessage(content=f"[Internal Critique]: {critique_text}")]
}
Step 3: Define the Routing Logic and Graph
This is where LangGraph shines. We define the conditional edge that decides whether to loop back or finish.
from langgraph.graph import StateGraph, END
# --- ROUTING LOGIC ---
def should_continue(state: AgentState):
# Safety valve: prevent infinite loops
if state.get("iterations", 0) >= 2:
return "end"
# Check the last message to see if the critique passed
last_message = state["messages"][-1].content
if "[Internal Critique]:" in last_message and "Fail" in last_message:
return "revise"
return "end"
# --- BUILD THE GRAPH ---
workflow = StateGraph(AgentState)
# Add nodes
workflow.add_node("generate", generate_answer)
workflow.add_node("reflect", reflect_answer)
# Define edges
workflow.set_entry_point("generate")
workflow.add_edge("generate", "reflect")
# Add conditional edge: loops back to generate if critique failed, otherwise ends
workflow.add_conditional_edges(
"reflect",
should_continue,
{
"revise": "generate",
"end": END
}
)
# Compile the graph
app = workflow.compile()
Step 4: Execute the Real-Time Use Case
Let's simulate the CloudOps scenario. We will mock the retrieved documents to represent our "strong top-k retrieval".
# Mocking the "Strong Top-K Retrieval"
mock_documents = [
Document(page_content="Incident INC-992: Payment gateway OOM. Root cause: Java heap space misconfigured in deployment.yaml. Fix applied by DevOps lead.", metadata={"date": "2026-05-12"}),
Document(page_content="Post-mortem for INC-992: To prevent OOM, we updated the memory limits. Command used: `kubectl set resources deployment/payment-gateway -n prod --limits=memory=2Gi --requests=memory=1Gi`", metadata={"date": "2026-05-14"}),
Document(page_content="General K8s guidelines: Always verify namespace before applying changes. Use `kubectl get pods -n <namespace>` to check status.", metadata={"date": "2026-01-01"})
]
# The complex user query
user_query = "The payment-gateway pod in us-east-1 is throwing OOM kills. Based on last month's post-mortems, what was the root cause, and what is the exact kubectl command to apply the memory limit fix?"
# Initial state
initial_state = {
"query": user_query,
"documents": mock_documents,
"messages": [],
"critique": "",
"iterations": 0
}
# Run the graph
print("--- Running Self-Correcting RAG ---")
final_state = app.invoke(initial_state)
# Extract and print the final answer
final_answer = final_state["messages"][-2].content # -2 is the actual answer, -1 is the internal critique
print("\n--- FINAL ANSWER ---")
print(final_answer)
print(f"\nTotal Iterations: {final_state['iterations']}")
Why This Fixes the "High Retrieval / Poor Generation" Problem
If you run a basic RAG pipeline on the query above, the LLM will often output:
"The root cause was a Java heap space misconfiguration. You can fix it by updating the memory limits in your deployment file using kubectl." (Fails to give the exact command).
By running it through our LangGraph Reflection Loop, the execution looks like this:
Iteration 1 (Generate): The LLM generates a vague answer, missing the exact kubectl set resources... command.
Iteration 1 (Reflect): The Critic LLM reads the draft and the context. It realizes the context contains the exact command, but the draft omitted it. It outputs: Completeness: Fail. Feedback: You failed to provide the exact kubectl command requested by the user.
Iteration 2 (Generate): The Generator receives the draft, the context, and the specific feedback. It rewrites the answer.
Iteration 2 (Reflect): The Critic checks the new draft. Completeness: Pass.
Route: The graph routes to END.
Final Output:
"Based on the post-mortem for incident INC-992, the root cause of the OOM kills was a misconfigured Java heap space in the deployment.yaml. To apply the memory limit fix, use the following exact command:
kubectl set resources deployment/payment-gateway -n prod --limits=memory=2Gi --requests=memory=1Gi"
Output
![Fixing DevOps RAG Failures Using LangGraph Reflection-1]()
![Fixing DevOps RAG Failures Using LangGraph Reflection-2]()
![Fixing DevOps RAG Failures Using LangGraph Reflection-3]()
Best Practices for Productionizing this LangGraph RAG
Cap the Iterations: Always include a safety valve (like iterations >= 2 in our code) to prevent infinite loops and runaway API costs.
Use Smaller Models for the Critic: The reflect node doesn't need GPT-4o. You can often use a faster, cheaper model (like GPT-4o-mini or Claude Haiku) for the reflection step to save latency and cost.
Combine with Reranking: While this article focuses on fixing the generation bottleneck, combining this LangGraph loop with a Cross-Encoder Reranker (like Cohere Rerank or BGE) on the retrieval step will yield state-of-the-art results.
Stream the Output: LangGraph supports streaming. You can stream the intermediate thoughts (the critiques) to your UI so the user sees the AI "thinking" and correcting itself, which massively boosts user trust.
Conclusion
High retrieval scores only mean you've successfully found the needle in the haystack. They don't mean the LLM knows how to use the needle to sew a solution. By moving away from linear, single-pass RAG and embracing cyclic, self-correcting architectures with LangGraph, you empower your LLM to reason, evaluate, and refine its own outputs. In complex, real-world scenarios like CloudOps, this shift from "retrieval-first" to "generation-aware" is the difference between a cool demo and a production-ready enterprise tool.