Abstract / Overview
GraphQA is an open-source project from Catio Tech designed to bridge the gap between graph databases and large language models (LLMs). It enables graph-grounded question answering (QA) — where an AI system retrieves factual context from structured graph data and integrates it with LLM reasoning to deliver accurate, explainable answers.
This developer-focused article explores GraphQA’s architecture, codebase, data flow, and practical implementation using Neo4j, LangChain, and OpenAI’s GPT models. You’ll learn how to build, run, and extend GraphQA for production-ready knowledge-driven AI applications.
Conceptual Background
![graphqa-knowledge-graph-llm-hero]()
The Need for Graph-Based QA
Most Retrieval-Augmented Generation (RAG) systems rely on vector databases like Pinecone or FAISS. These perform semantic similarity search, but cannot model entity relationships — the backbone of complex reasoning.
Graphs, on the other hand, explicitly represent:
Entities (nodes) such as people, companies, or locations.
Relationships (edges) like “works_at”, “founded_by”, or “located_in”.
Integrating this structure with an LLM allows the model to perform multi-hop reasoning — answering questions like:
“Which startups founded by ex-Google employees raised Series B in 2023?”
This kind of query demands logical traversal across entities, not just keyword matching.
What GraphQA Does
GraphQA provides:
A unified query interface for graph + text search.
Modular RAG pipelines that merge graph traversals with LLM reasoning.
Tools to embed, index, and retrieve graph data for semantic augmentation.
Architecture Overview
![graphqa-architecture-neo4j-llm]()
Key Features
Multi-hop graph reasoning with Neo4j or Memgraph.
Natural language query translation to Cypher or GQL.
Hybrid retrieval: combines graph-based and vector-based context.
LLM integration via LangChain and OpenAI API.
Explainability: returns both a textual answer and a graph traversal path.
Step-by-Step Developer Guide
1. Prerequisites
Install dependencies:
git clone https://github.com/catio-tech/graphqa.git
cd graphqa
pip install -r requirements.txt
You’ll need:
2. Setup Environment
Create a .env
file in the project root:
OPENAI_API_KEY=YOUR_API_KEY
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=YOUR_PASSWORD
Load environment variables:
from dotenv import load_dotenv
import os
load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
neo4j_uri = os.getenv("NEO4J_URI")
3. Graph Data Model
Example: startup ecosystem.
CREATE (g:Company {name:'Google'})
CREATE (a:Person {name:'Sundar Pichai'})-[:WORKS_AT]->(g)
CREATE (b:Startup {name:'DeepMind'})-[:ACQUIRED_BY]->(g)
CREATE (c:Person {name:'Demis Hassabis'})-[:FOUNDED]->(b)
4. Graph Retriever
graph_retriever.py
: retrieves entities via Cypher queries.
from neo4j import GraphDatabase
class GraphRetriever:
def __init__(self, uri, user, password):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def query(self, cypher_query):
with self.driver.session() as session:
return session.run(cypher_query).data()
5. Query Translator (Natural Language → Cypher)
from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate
prompt = PromptTemplate(
input_variables=["question"],
template="Translate the question into a Cypher query for Neo4j:\nQuestion: {question}\nCypher:"
)
def to_cypher(question):
llm = ChatOpenAI(temperature=0, model="gpt-4")
return llm.predict(prompt.format(question=question))
Example Input:
to_cypher("Which startups were founded by people who worked at Google?")
Example Output:
MATCH (p:Person)-[:WORKED_AT]->(c:Company {name:'Google'})-[:FOUNDED]->(s:Startup)
RETURN s.name
6. Context Combination & Answer Generation
from langchain.chains import LLMChain
context = """
DeepMind was founded by Demis Hassabis, who previously worked with Google.
"""
qa_prompt = PromptTemplate(
input_variables=["question", "context"],
template="Answer the question based on the context.\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
)
def answer_question(question, context):
llm = ChatOpenAI(model="gpt-4", temperature=0.3)
chain = LLMChain(prompt=qa_prompt, llm=llm)
return chain.run({"question": question, "context": context})
7. Running the Full GraphQA Pipeline
from graph_retriever import GraphRetriever
from query_translator import to_cypher
from answer_generator import answer_question
def graphqa_pipeline(question):
cypher = to_cypher(question)
retriever = GraphRetriever(neo4j_uri, "neo4j", "password")
results = retriever.query(cypher)
context = "\n".join([str(r) for r in results])
return answer_question(question, context)
Example Run:
print(graphqa_pipeline("Which startups were founded by people who worked at Google?"))
Output:
DeepMind was founded by Demis Hassabis, a former Google researcher.
Use Cases / Scenarios
Enterprise Knowledge Management: Combine structured business graphs with document embeddings for internal Q&A.
Biomedical Research: Connect gene, protein, and drug graphs for biomedical discovery.
Supply Chain Analytics: Trace dependencies and partner relationships across global supply chains.
Financial Intelligence: Analyze relationships between investors, companies, and funding events.
Education: Build explainable tutoring systems grounded in knowledge graphs.
Performance and Scaling
Component | Optimization |
---|
Neo4j Queries | Index frequently queried nodes (e.g., Person, Company). |
LLM Calls | Cache Cypher translations using SQLite or Redis. |
Hybrid Retrieval | Combine top-5 graph results with vector embeddings. |
Parallelization | Use async LLM calls for multi-hop traversal chains. |
Limitations / Considerations
LLM Query Drift: Cypher translations may include irrelevant joins — mitigate with strict templates.
Data Privacy: Sensitive graph data must be anonymized before LLM calls.
Latency: Graph traversals + LLM API requests may exceed 5s on large datasets.
Cost: Hybrid retrieval pipelines can generate multiple API calls per query.
Fixes / Troubleshooting
Issue | Cause | Fix |
---|
Incorrect Cypher generation | Ambiguous natural language | Add example-driven fine-tuning |
Missing context | Poor graph connectivity | Add relationship edges or use multi-hop expansion |
API key error | Missing .env variable | Reload environment with load_dotenv() |
Slow performance | Overly large graph traversal | Limit depth with MATCH ... LIMIT 5 |
FAQs
Q1: Can GraphQA work without Neo4j? Yes. You can plug in Memgraph or ArangoDB with minimal code changes by implementing a compatible GraphRetriever
.
Q2: Does it support multimodal data? Planned. Future releases will integrate image and document embeddings alongside graph data.
Q3: How does GraphQA differ from LangChain’s Graph Index? GraphQA focuses on end-to-end orchestration — including translation, retrieval, and reasoning — while LangChain Graph Index handles retrieval only.
Q4: Is GraphQA production-ready? Yes. The modular design allows Dockerized deployment, caching, and integration with FastAPI microservices.
References
Conclusion
GraphQA exemplifies the next evolution of Retrieval-Augmented Generation — from text-based to graph-grounded reasoning. By connecting LLMs to structured knowledge graphs, developers can build factually accurate, explainable, and scalable AI systems capable of complex multi-hop inference.
For developers aiming to build trustworthy AI, GraphQA represents a blueprint: blending graph databases, natural language understanding, and generative reasoning into a unified, open framework.