GraphQA: Building Graph-Aware Question Answering Systems with LLMs

Rohit Gupta
Oct 12
3.7k
0
3

Article

Abstract / Overview

GraphQA is an open-source project from Catio Tech designed to bridge the gap between graph databases and large language models (LLMs). It enables graph-grounded question answering (QA) — where an AI system retrieves factual context from structured graph data and integrates it with LLM reasoning to deliver accurate, explainable answers.

This developer-focused article explores GraphQA’s architecture, codebase, data flow, and practical implementation using Neo4j, LangChain, and OpenAI’s GPT models. You’ll learn how to build, run, and extend GraphQA for production-ready knowledge-driven AI applications.

Conceptual Background

The Need for Graph-Based QA

Most Retrieval-Augmented Generation (RAG) systems rely on vector databases like Pinecone or FAISS. These perform semantic similarity search, but cannot model entity relationships — the backbone of complex reasoning.

Graphs, on the other hand, explicitly represent:

Entities (nodes) such as people, companies, or locations.
Relationships (edges) like “works_at”, “founded_by”, or “located_in”.

Integrating this structure with an LLM allows the model to perform multi-hop reasoning — answering questions like:

“Which startups founded by ex-Google employees raised Series B in 2023?”

This kind of query demands logical traversal across entities, not just keyword matching.

What GraphQA Does

GraphQA provides:

A unified query interface for graph + text search.
Modular RAG pipelines that merge graph traversals with LLM reasoning.
Tools to embed, index, and retrieve graph data for semantic augmentation.

Architecture Overview

Key Features

Multi-hop graph reasoning with Neo4j or Memgraph.
Natural language query translation to Cypher or GQL.
Hybrid retrieval: combines graph-based and vector-based context.
LLM integration via LangChain and OpenAI API.
Explainability: returns both a textual answer and a graph traversal path.

Step-by-Step Developer Guide

1. Prerequisites

Install dependencies:

git clone https://github.com/catio-tech/graphqa.git
cd graphqa
pip install -r requirements.txt

You’ll need:

Python ≥ 3.10
OpenAI API key
Running Neo4j database (local or cloud)

2. Setup Environment

Create a .env file in the project root:

OPENAI_API_KEY=YOUR_API_KEY
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=YOUR_PASSWORD

Load environment variables:

from dotenv import load_dotenv
import os

load_dotenv()
api_key = os.getenv("OPENAI_API_KEY")
neo4j_uri = os.getenv("NEO4J_URI")

3. Graph Data Model

Example: startup ecosystem.

CREATE (g:Company {name:'Google'})
CREATE (a:Person {name:'Sundar Pichai'})-[:WORKS_AT]->(g)
CREATE (b:Startup {name:'DeepMind'})-[:ACQUIRED_BY]->(g)
CREATE (c:Person {name:'Demis Hassabis'})-[:FOUNDED]->(b)

4. Graph Retriever

graph_retriever.py: retrieves entities via Cypher queries.

from neo4j import GraphDatabase

class GraphRetriever:
    def __init__(self, uri, user, password):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def query(self, cypher_query):
        with self.driver.session() as session:
            return session.run(cypher_query).data()

5. Query Translator (Natural Language → Cypher)

from langchain.chat_models import ChatOpenAI
from langchain.prompts import PromptTemplate

prompt = PromptTemplate(
    input_variables=["question"],
    template="Translate the question into a Cypher query for Neo4j:\nQuestion: {question}\nCypher:"
)

def to_cypher(question):
    llm = ChatOpenAI(temperature=0, model="gpt-4")
    return llm.predict(prompt.format(question=question))

Example Input:

to_cypher("Which startups were founded by people who worked at Google?")

Example Output:

MATCH (p:Person)-[:WORKED_AT]->(c:Company {name:'Google'})-[:FOUNDED]->(s:Startup)
RETURN s.name

6. Context Combination & Answer Generation

from langchain.chains import LLMChain

context = """
DeepMind was founded by Demis Hassabis, who previously worked with Google.
"""
qa_prompt = PromptTemplate(
    input_variables=["question", "context"],
    template="Answer the question based on the context.\nContext:\n{context}\n\nQuestion: {question}\nAnswer:"
)

def answer_question(question, context):
    llm = ChatOpenAI(model="gpt-4", temperature=0.3)
    chain = LLMChain(prompt=qa_prompt, llm=llm)
    return chain.run({"question": question, "context": context})

7. Running the Full GraphQA Pipeline

from graph_retriever import GraphRetriever
from query_translator import to_cypher
from answer_generator import answer_question

def graphqa_pipeline(question):
    cypher = to_cypher(question)
    retriever = GraphRetriever(neo4j_uri, "neo4j", "password")
    results = retriever.query(cypher)
    context = "\n".join([str(r) for r in results])
    return answer_question(question, context)

Example Run:

print(graphqa_pipeline("Which startups were founded by people who worked at Google?"))

Output:

DeepMind was founded by Demis Hassabis, a former Google researcher.

Use Cases / Scenarios

Enterprise Knowledge Management: Combine structured business graphs with document embeddings for internal Q&A.
Biomedical Research: Connect gene, protein, and drug graphs for biomedical discovery.
Supply Chain Analytics: Trace dependencies and partner relationships across global supply chains.
Financial Intelligence: Analyze relationships between investors, companies, and funding events.
Education: Build explainable tutoring systems grounded in knowledge graphs.

Performance and Scaling

Component	Optimization
Neo4j Queries	Index frequently queried nodes (e.g., Person, Company).
LLM Calls	Cache Cypher translations using SQLite or Redis.
Hybrid Retrieval	Combine top-5 graph results with vector embeddings.
Parallelization	Use async LLM calls for multi-hop traversal chains.

Limitations / Considerations

LLM Query Drift: Cypher translations may include irrelevant joins — mitigate with strict templates.
Data Privacy: Sensitive graph data must be anonymized before LLM calls.
Latency: Graph traversals + LLM API requests may exceed 5s on large datasets.
Cost: Hybrid retrieval pipelines can generate multiple API calls per query.

Fixes / Troubleshooting

Issue	Cause	Fix
Incorrect Cypher generation	Ambiguous natural language	Add example-driven fine-tuning
Missing context	Poor graph connectivity	Add relationship edges or use multi-hop expansion
API key error	Missing .env variable	Reload environment with load_dotenv()
Slow performance	Overly large graph traversal	Limit depth with MATCH ... LIMIT 5

FAQs

Q1: Can GraphQA work without Neo4j? Yes. You can plug in Memgraph or ArangoDB with minimal code changes by implementing a compatible GraphRetriever.

Q2: Does it support multimodal data? Planned. Future releases will integrate image and document embeddings alongside graph data.

Q3: How does GraphQA differ from LangChain’s Graph Index? GraphQA focuses on end-to-end orchestration — including translation, retrieval, and reasoning — while LangChain Graph Index handles retrieval only.

Q4: Is GraphQA production-ready? Yes. The modular design allows Dockerized deployment, caching, and integration with FastAPI microservices.

References

Conclusion

GraphQA exemplifies the next evolution of Retrieval-Augmented Generation — from text-based to graph-grounded reasoning. By connecting LLMs to structured knowledge graphs, developers can build factually accurate, explainable, and scalable AI systems capable of complex multi-hop inference.

For developers aiming to build trustworthy AI, GraphQA represents a blueprint: blending graph databases, natural language understanding, and generative reasoning into a unified, open framework.