Implementing AI Document Query (“ask your PDF”) in ASP.NET Core

Rajesh Gami
4h
64
0
0

Article

This article shows a clear, safe, and maintainable pattern to implement “ask your PDF” in ASP.NET Core: extract text from PDF → chunk + embed → store vectors in a vector store → run semantic search → use LLM to generate answers (RAG). I include architecture, ER diagram, sequence, sample C# code (PDF extraction, embedding call, Pinecone integration), production concerns and monitoring.

Overview (what you will build)

PDF ingestion: accept PDF upload, extract text and metadata.
Chunking: split large text into overlapping passages (512–1,600 tokens).
Embeddings: call embedding model to get vector for each chunk.
Vector store: persist vectors + metadata (doc id, chunk id, cursor) — Pinecone (or Weaviate) is recommended.
Query: embed user question, nearest-neighbour search to get top-k chunks.
Answering: pass retrieved chunks + user question to LLM (prompt template) to produce final answer and citations.
UI: simple endpoint that returns answer + source chunks.

Key choices you can swap: embedding provider (OpenAI, Azure OpenAI), vector DB (Pinecone, Weaviate, local SQL-based store), chunk size and overlap, and prompt template.

Two important facts up front:

OpenAI-style embeddings are the usual choice and are documented in official guides.
Pinecone provides an official .NET SDK and production features (indexing, vector metadata, Pinecone Local for CI). Use it for production vector storage.

Architecture (compact)

[Upload PDF] --> [ASP.NET Core API] --> {PDF Extract + Chunk} 
       |                                     |
       v                                     v
  [User Query] <-- [LLM answer (RAG)] <-- [Vector DB (Pinecone)]

ER-style view (very small)

Document (DocumentId, Filename, UploadedAt)
  1 → * Chunks (ChunkId, DocumentId, Text, StartPos, EndPos, EmbeddingId)
EmbeddingStore (EmbeddingId, Vector[], Metadata(json))

Sequence (short)

User uploads PDF → API stores file, starts ingestion job.
Ingestion: extract plain text, split into chunks, compute embeddings, upsert vectors into Pinecone with chunk metadata.
User asks a question → API embeds query → vector DB nearest-neighbour search → fetch top-k chunks → compose prompt with chunks → call LLM for answer → return answer + chunk citations.

Components & responsibilities

API (ASP.NET Core) — upload endpoints, query endpoints, job orchestration.
Ingest worker — background worker or queued job to process large files asynchronously.
PDF extractor — iText7, PdfPig, or commercial libraries; handle scanned PDFs with OCR (Tesseract) if needed.
Embedding service — wrapper to call OpenAI/Azure embeddings API.
Vector DB — Pinecone (official .NET SDK) or Weaviate; store vectors and metadata and do nearest-neighbour queries.
LLM service — OpenAI/GPT or Azure OpenAI for final answer generation (prompt + retrieved chunks).

PDF extraction (C# sample)

Use iText7 (or PdfPig). For scanned PDFs use OCR (Tesseract) before text extraction.

// Using iText7 (NuGet: itext7)
public string ExtractTextFromPdf(string filePath)
{
    var sb = new StringBuilder();
    using (var pdf = new iText.Kernel.Pdf.PdfReader(filePath))
    using (var pdfDoc = new iText.Kernel.Pdf.PdfDocument(pdf))
    {
        var num = pdfDoc.GetNumberOfPages();
        for (int i = 1; i <= num; i++)
        {
            var page = pdfDoc.GetPage(i);
            var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.SimpleTextExtractionStrategy();
            var text = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
            sb.AppendLine(text);
        }
    }
    return sb.ToString();
}

Notes

iText7 is robust for layout-aware extraction; for scanned images run OCR before or use commercial OCR.

Chunking strategy

Chunk length: ~500–1,500 tokens (~2–3 KB per chunk), overlap: 10–20% to preserve context.
Keep metadata: document id, chunk index, original character offsets, page number.

Simple chunker (pseudocode)

List<Chunk> ChunkText(string text, int chunkSize = 2000, int overlap = 200)
{
    var chunks = new List<Chunk>();
    int start = 0;
    while (start < text.Length)
    {
        var end = Math.Min(start + chunkSize, text.Length);
        var slice = text.Substring(start, end - start);
        chunks.Add(new Chunk { Text = slice, Start = start, End = end });
        start = Math.Max(end - overlap, end); // move forward with overlap
    }
    return chunks;
}

Embeddings: call OpenAI (C#)

Official OpenAI docs show how to call embeddings; keep temperature = 0 and use the recommended embedding model. Validate token limits and chunk sizes.

Example using HTTP client (simplified)

public async Task<float[]> GetEmbeddingAsync(string text)
{
    var payload = new {
        model = "text-embedding-3-small",
        input = text
    };

    var resp = await _http.PostAsJsonAsync("https://api.openai.com/v1/embeddings", payload);
    var doc = await resp.Content.ReadFromJsonAsync<OpenAiEmbeddingResponse>();
    return doc.data[0].embedding;
}

Keep batching in mind: embed many chunks per request (the API supports batch inputs).

Vector DB: Pinecone (C#)

Pinecone provides an official .NET SDK. Use index upsert and query features. Set metadata for each vector (documentId, chunkId, text snippet or pointer).

Basic pattern

// Create Pinecone client (from official SDK)
var pinecone = new PineconeClient(apiKey);

// Upsert vectors
await pinecone.Indexes.UpsertAsync(indexName, new UpsertRequest {
  vectors = chunks.Select(c => new Vector {
    id = $"{docId}_{c.Index}",
    values = c.Embedding, // float[]
    metadata = new { documentId = docId, chunkIndex = c.Index, text = c.TextSnippet }
  }).ToList()
});

// Query
var q = await pinecone.Indexes.QueryAsync(indexName, 
    new QueryRequest {
        vector = queryEmbedding,
        topK = 5,
        includeMetadata = true
    });

Tips

Use Pinecone Local for CI/dev to avoid cloud costs.
Store full text only as necessary — store short snippet + pointer to document for retrieval.

Retrieval + Answer generation (RAG)

When a user asks a question:

Embed the question.
Query vector DB (top-k, e.g. k=5–10).
Retrieve metadata and text of top chunks.
Build an LLM prompt: system + few-shot + include retrieved chunks as context + user question + instruction to cite chunk ids/pages.
Call LLM (GPT) to generate final answer and ask it to produce a short list of sources (chunk ids and page numbers).

Prompt outline

System: You are a helpful assistant that answers using only the provided context. If context does not contain the answer, say "I don't know".
Context:
[CHUNK 1: doc=abc.pdf page=10] ...text...
[CHUNK 2: ...]
User question: "<user question here>"
Instruction: Provide a concise answer and at the end list the sources used as bullet points with document filename and page/chunk.

Always include a final step to verify that the answer does not hallucinate — instruct the model to say "No answer found" when confidence is low.

Example ASP.NET Core endpoints

POST /api/documents/upload — stores file, returns documentId and starts ingestion job.
GET /api/documents/{id}/status — ingestion progress.
POST /api/query — body { documentIds?:[], query: string } → returns { answer, sources[] }.

Query handler (simplified)

[HttpPost("query")]
public async Task<IActionResult> Query([FromBody] QueryRequest req)
{
    var qEmbedding = await _embedService.GetEmbeddingAsync(req.Query);
    var results = await _pinecone.QueryAsync(indexName, qEmbedding, topK: req.TopK ?? 5);
    var contexts = results.Matches.Select(m => m.Metadata);
    var prompt = _promptBuilder.BuildWithContexts(req.Query, contexts);
    var answer = await _llmService.GenerateAsync(prompt);
    return Ok(new { Answer = answer.Text, Sources = contexts.Select(c => new { c.documentId, c.chunkIndex }) });
}

Practical tips & production concerns

Batch embeddings: embed chunks in batches to reduce latency and cost.
Metadata: store filename, page number, chunk index, char offsets and a short snippet. This lets UI show highlighted context.
Pinecone config: choose appropriate index metric (cosine) and dimension matching chosen embedding model. Use namespaces to separate tenants or projects.
OCR: if PDFs are scanned images, run OCR (Tesseract or commercial) before embeddings.
Chunk sizing: too large → expensive embeddings; too small → context loss. 500–1500 tokens typical.
Cache query embeddings for repeated questions to same doc.
Cost control: monitor token usage and embedding calls; batch requests and cache embeddings.
Security: protect API keys (OpenAI, Pinecone) in Key Vault / environment variables; do not log raw PDF contents to logs in production.
Data retention & privacy: provide deletion workflow for uploaded docs; redact PII before sending to external APIs if required by policy.

Citations for major tool claims: OpenAI embeddings docs and Pinecone .NET SDK & docs are authoritative.
For PDF extraction via iText see iText documentation.
For local Pinecone emulator (useful for CI) see Pinecone Local docs.

Monitoring, testing and observability

Track ingestion metrics: files processed/day, average chunks per doc, embedding calls.
Log vector upsert failures and embedding failures separately.
Expose trace id for each query so you can connect answer back to the exact chunks used.
Add unit tests for chunker, integration tests for embedding+vector DB (use Pinecone Local), and end-to-end tests for RAG pipeline.

Deployment checklist

Store keys in Key Vault / secrets manager.
Use background worker (IHostedService or queue like Hangfire/BackgroundJobs) for ingestion.
Use retries with exponential backoff for network calls to OpenAI and Pinecone.
Limit concurrent ingestion jobs to avoid hitting rate limits.
Use containerized workers; Pinecone Local can be used in CI for tests.