AI Automation & Agents  

πŸš€ Building a Local Document RAG System Using Node.js, Supabase, and OpenAI

AI-powered search is exploding right now, and Retrieval-Augmented Generation (RAG) is at the heart of it. From chatbots that understand your company policies to internal knowledge assistants, RAG lets you query private documents using natural languageβ€”securely and accurately.

In this article, I’ll walk you through how I built a fully local document RAG application using:

  • Node.js (Express)

  • Supabase (Postgres + pgvector)

  • OpenAI Embeddings + Chat Models

  • pdf-parse (PDF text extraction)

Let’s break it down.

πŸ’‘ What We’re Building

Imagine placing a PDF like Policies.pdf inside a folder.
Now you want to ask:

β€œWhat is the leave policy?” β€œHow many casual leaves are allowed?” β€œWhat is the work-from-home guideline?”

This RAG system:

  1. Reads the PDF

  2. Extracts text

  3. Breaks text into overlapping chunks

  4. Converts chunks into embeddings

  5. Stores them in Supabase pgvector

  6. Uses semantic search to find relevant chunks

  7. Generates accurate answers using OpenAI

All from your local machine.

πŸ—οΈ Architecture Overview

Here’s how the flow works end-to-end:

MyDocs/ PDFs
       ↓
pdf-parse extracts text
       ↓
Chunking (1000 chars + 200 overlap)
       ↓
OpenAI Embeddings (1536-d vectors)
       ↓
Supabase (pgvector)
       ↓
Semantic Search (match_documents RPC)
       ↓
LLM (GPT-4.1-mini or any model)
       ↓
Answer to User

Simple, modular, and scalable.

πŸ“‚ Folder Structure

Document-RAG-App/
β”‚
β”œβ”€β”€ MyDocs/
β”‚   β”œβ”€β”€ Policies.pdf     # Local documents
β”‚
β”œβ”€β”€ index.js             # Main RAG backend
β”œβ”€β”€ package.json
β”œβ”€β”€ .env                 # API keys
└── README.md

Just drop your PDFs into MyDocs/ and hit the indexing API.

πŸ”§ Setting Up the Project

Step 1: Install Dependencies

npm install express cors dotenv openai @supabase/supabase-js pdf-parse

Step 2: Environment Variables

Create .env:

OPENAI_API_KEY=your_openai_key
SUPABASE_URL=https://your-project-url.supabase.co
SUPABASE_ANON_KEY=your_anon_key
PORT=3000

πŸ—„οΈ Supabase Setup (Vector DB)

Supabase provides Postgres + pgvector, a perfect fit for RAG apps.

1. Enable the vector extension

create extension if not exists vector;

2. Create the document table

create table if not exists "MyPDFDocuments" (
  id bigserial primary key,
  content text,
  embedding vector(1536),
  title text,
  source text,
  path text,
  created_at timestamptz default now()
);

3. Create a vector index (recommended)

create index if not exists mypdfdocuments_embedding_idx
on "MyPDFDocuments"
using ivfflat (embedding vector_cosine_ops)
with (lists = 100);

4. Add the semantic search RPC function

create or replace function match_documents(
  query_embedding vector(1536),
  match_threshold float,
  match_count int
)
returns table (
  id bigint,
  content text,
  similarity float
)
language plpgsql
as $$
beginreturn query
  select
    d.id,
    d.content,
    1 - (d.embedding <=> query_embedding) as similarity
  from "MyPDFDocuments" d
  where 1 - (d.embedding <=> query_embedding) > match_threshold
  order by similarity desc
  limit match_count;
end;
$$;

πŸ“₯ Indexing Local PDF Files

Your app exposes a single endpoint:

▢️ POST /index-docs

This:

  • Reads all PDFs in MyDocs/

  • Extracts text using pdf-parse

  • Generates embeddings

  • Pushes them to Supabase

Example

curl -X POST http://localhost:3000/index-docs

You’ll see logs like:

Indexing PDF: MyDocs/Policies.pdf

❓ Ask Anything With Natural Language

▢️ POST /query

Request body:

{"query": "What is the leave policy?"}

Example query:

curl -X POST http://localhost:3000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What is the leave policy?"}'

The app:

  1. Embeds your question

  2. Finds the most relevant chunks from Supabase

  3. Sends those chunks as context to OpenAI

  4. Returns a clean, helpful answer

Response

{"answer": "According to the company leave policy..."}

🧠 Understanding the RAG Pipeline

1. PDF Extraction (pdf-parse)

Extracts raw text from each page of the PDF.

2. Chunking

Chunks of:

  • 1000 characters

  • 200 character overlap

This overlap ensures continuity of context during vector search.

3. Embeddings

Uses:

text-embedding-3-small
1536 dimensions

4. Vector Storage

Each chunk becomes a vector row in Supabase.

5. Semantic Search

Cosine similarity picks the most relevant chunks.

6. Final Answer

LLM generates a precise answer using those chunks.

πŸ› Troubleshooting

ProblemReasonSolution
No text extractedPDF is a scanned imageUse OCR (Tesseract)
No resultsThreshold too strictUse 0.2 instead of 0.5
Embedding dimension errorWrong vector sizeEnsure vector(1536)
Slow searchMissing indexCreate IVFFLAT index
OpenAI API errorsWrong keyFix .env

You can find this complete code repository here on my GitHub:

https://github.com/NitinPandit/RAG-Document-Nodejs