AI-powered search is exploding right now, and Retrieval-Augmented Generation (RAG) is at the heart of it. From chatbots that understand your company policies to internal knowledge assistants, RAG lets you query private documents using natural languageβsecurely and accurately.
In this article, Iβll walk you through how I built a fully local document RAG application using:
Node.js (Express)
Supabase (Postgres + pgvector)
OpenAI Embeddings + Chat Models
pdf-parse (PDF text extraction)
Letβs break it down.
π‘ What Weβre Building
Imagine placing a PDF like Policies.pdf inside a folder.
Now you want to ask:
βWhat is the leave policy?β βHow many casual leaves are allowed?β βWhat is the work-from-home guideline?β
This RAG system:
Reads the PDF
Extracts text
Breaks text into overlapping chunks
Converts chunks into embeddings
Stores them in Supabase pgvector
Uses semantic search to find relevant chunks
Generates accurate answers using OpenAI
All from your local machine.
ποΈ Architecture Overview
Hereβs how the flow works end-to-end:
MyDocs/ PDFs
β
pdf-parse extracts text
β
Chunking (1000 chars + 200 overlap)
β
OpenAI Embeddings (1536-d vectors)
β
Supabase (pgvector)
β
Semantic Search (match_documents RPC)
β
LLM (GPT-4.1-mini or any model)
β
Answer to User
Simple, modular, and scalable.
π Folder Structure
Document-RAG-App/
β
βββ MyDocs/
β βββ Policies.pdf # Local documents
β
βββ index.js # Main RAG backend
βββ package.json
βββ .env # API keys
βββ README.md
Just drop your PDFs into MyDocs/ and hit the indexing API.
π§ Setting Up the Project
Step 1: Install Dependencies
npm install express cors dotenv openai @supabase/supabase-js pdf-parse
Step 2: Environment Variables
Create .env:
OPENAI_API_KEY=your_openai_key
SUPABASE_URL=https://your-project-url.supabase.co
SUPABASE_ANON_KEY=your_anon_key
PORT=3000
ποΈ Supabase Setup (Vector DB)
Supabase provides Postgres + pgvector, a perfect fit for RAG apps.
1. Enable the vector extension
create extension if not exists vector;
2. Create the document table
create table if not exists "MyPDFDocuments" (
id bigserial primary key,
content text,
embedding vector(1536),
title text,
source text,
path text,
created_at timestamptz default now()
);
3. Create a vector index (recommended)
create index if not exists mypdfdocuments_embedding_idx
on "MyPDFDocuments"
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
4. Add the semantic search RPC function
create or replace function match_documents(
query_embedding vector(1536),
match_threshold float,
match_count int
)
returns table (
id bigint,
content text,
similarity float
)
language plpgsql
as $$
beginreturn query
select
d.id,
d.content,
1 - (d.embedding <=> query_embedding) as similarity
from "MyPDFDocuments" d
where 1 - (d.embedding <=> query_embedding) > match_threshold
order by similarity desc
limit match_count;
end;
$$;
π₯ Indexing Local PDF Files
Your app exposes a single endpoint:
βΆοΈ POST /index-docs
This:
Example
curl -X POST http://localhost:3000/index-docs
Youβll see logs like:
Indexing PDF: MyDocs/Policies.pdf
β Ask Anything With Natural Language
βΆοΈ POST /query
Request body:
{"query": "What is the leave policy?"}
Example query:
curl -X POST http://localhost:3000/query \
-H "Content-Type: application/json" \
-d '{"query": "What is the leave policy?"}'
The app:
Embeds your question
Finds the most relevant chunks from Supabase
Sends those chunks as context to OpenAI
Returns a clean, helpful answer
Response
{"answer": "According to the company leave policy..."}
π§ Understanding the RAG Pipeline
1. PDF Extraction (pdf-parse)
Extracts raw text from each page of the PDF.
2. Chunking
Chunks of:
1000 characters
200 character overlap
This overlap ensures continuity of context during vector search.
3. Embeddings
Uses:
text-embedding-3-small
1536 dimensions
4. Vector Storage
Each chunk becomes a vector row in Supabase.
5. Semantic Search
Cosine similarity picks the most relevant chunks.
6. Final Answer
LLM generates a precise answer using those chunks.
π Troubleshooting
| Problem | Reason | Solution |
|---|
| No text extracted | PDF is a scanned image | Use OCR (Tesseract) |
| No results | Threshold too strict | Use 0.2 instead of 0.5 |
| Embedding dimension error | Wrong vector size | Ensure vector(1536) |
| Slow search | Missing index | Create IVFFLAT index |
| OpenAI API errors | Wrong key | Fix .env |
You can find this complete code repository here on my GitHub:
https://github.com/NitinPandit/RAG-Document-Nodejs