ASP.NET Core  

Implementing AI Document Query (“ask your PDF”) in ASP.NET Core

This article shows a clear, safe, and maintainable pattern to implement “ask your PDF” in ASP.NET Core: extract text from PDF → chunk + embed → store vectors in a vector store → run semantic search → use LLM to generate answers (RAG). I include architecture, ER diagram, sequence, sample C# code (PDF extraction, embedding call, Pinecone integration), production concerns and monitoring.

Overview (what you will build)

  1. PDF ingestion: accept PDF upload, extract text and metadata.

  2. Chunking: split large text into overlapping passages (512–1,600 tokens).

  3. Embeddings: call embedding model to get vector for each chunk.

  4. Vector store: persist vectors + metadata (doc id, chunk id, cursor) — Pinecone (or Weaviate) is recommended.

  5. Query: embed user question, nearest-neighbour search to get top-k chunks.

  6. Answering: pass retrieved chunks + user question to LLM (prompt template) to produce final answer and citations.

  7. UI: simple endpoint that returns answer + source chunks.

Key choices you can swap: embedding provider (OpenAI, Azure OpenAI), vector DB (Pinecone, Weaviate, local SQL-based store), chunk size and overlap, and prompt template.

Two important facts up front:

  • OpenAI-style embeddings are the usual choice and are documented in official guides.

  • Pinecone provides an official .NET SDK and production features (indexing, vector metadata, Pinecone Local for CI). Use it for production vector storage.

Architecture (compact)

[Upload PDF] --> [ASP.NET Core API] --> {PDF Extract + Chunk} 
       |                                     |
       v                                     v
  [User Query] <-- [LLM answer (RAG)] <-- [Vector DB (Pinecone)]

ER-style view (very small)

Document (DocumentId, Filename, UploadedAt)
  1 → * Chunks (ChunkId, DocumentId, Text, StartPos, EndPos, EmbeddingId)
EmbeddingStore (EmbeddingId, Vector[], Metadata(json))

Sequence (short)

  1. User uploads PDF → API stores file, starts ingestion job.

  2. Ingestion: extract plain text, split into chunks, compute embeddings, upsert vectors into Pinecone with chunk metadata.

  3. User asks a question → API embeds query → vector DB nearest-neighbour search → fetch top-k chunks → compose prompt with chunks → call LLM for answer → return answer + chunk citations.

Components & responsibilities

  • API (ASP.NET Core) — upload endpoints, query endpoints, job orchestration.

  • Ingest worker — background worker or queued job to process large files asynchronously.

  • PDF extractor — iText7, PdfPig, or commercial libraries; handle scanned PDFs with OCR (Tesseract) if needed.

  • Embedding service — wrapper to call OpenAI/Azure embeddings API.

  • Vector DB — Pinecone (official .NET SDK) or Weaviate; store vectors and metadata and do nearest-neighbour queries.

  • LLM service — OpenAI/GPT or Azure OpenAI for final answer generation (prompt + retrieved chunks).

PDF extraction (C# sample)

Use iText7 (or PdfPig). For scanned PDFs use OCR (Tesseract) before text extraction.

// Using iText7 (NuGet: itext7)
public string ExtractTextFromPdf(string filePath)
{
    var sb = new StringBuilder();
    using (var pdf = new iText.Kernel.Pdf.PdfReader(filePath))
    using (var pdfDoc = new iText.Kernel.Pdf.PdfDocument(pdf))
    {
        var num = pdfDoc.GetNumberOfPages();
        for (int i = 1; i <= num; i++)
        {
            var page = pdfDoc.GetPage(i);
            var strategy = new iText.Kernel.Pdf.Canvas.Parser.Listener.SimpleTextExtractionStrategy();
            var text = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(page, strategy);
            sb.AppendLine(text);
        }
    }
    return sb.ToString();
}

Notes

  • iText7 is robust for layout-aware extraction; for scanned images run OCR before or use commercial OCR.

Chunking strategy

  • Chunk length: ~500–1,500 tokens (~2–3 KB per chunk), overlap: 10–20% to preserve context.

  • Keep metadata: document id, chunk index, original character offsets, page number.

Simple chunker (pseudocode)

List<Chunk> ChunkText(string text, int chunkSize = 2000, int overlap = 200)
{
    var chunks = new List<Chunk>();
    int start = 0;
    while (start < text.Length)
    {
        var end = Math.Min(start + chunkSize, text.Length);
        var slice = text.Substring(start, end - start);
        chunks.Add(new Chunk { Text = slice, Start = start, End = end });
        start = Math.Max(end - overlap, end); // move forward with overlap
    }
    return chunks;
}

Embeddings: call OpenAI (C#)

Official OpenAI docs show how to call embeddings; keep temperature = 0 and use the recommended embedding model. Validate token limits and chunk sizes.

Example using HTTP client (simplified)

public async Task<float[]> GetEmbeddingAsync(string text)
{
    var payload = new {
        model = "text-embedding-3-small",
        input = text
    };

    var resp = await _http.PostAsJsonAsync("https://api.openai.com/v1/embeddings", payload);
    var doc = await resp.Content.ReadFromJsonAsync<OpenAiEmbeddingResponse>();
    return doc.data[0].embedding;
}

Keep batching in mind: embed many chunks per request (the API supports batch inputs).

Vector DB: Pinecone (C#)

Pinecone provides an official .NET SDK. Use index upsert and query features. Set metadata for each vector (documentId, chunkId, text snippet or pointer).

Basic pattern

// Create Pinecone client (from official SDK)
var pinecone = new PineconeClient(apiKey);

// Upsert vectors
await pinecone.Indexes.UpsertAsync(indexName, new UpsertRequest {
  vectors = chunks.Select(c => new Vector {
    id = $"{docId}_{c.Index}",
    values = c.Embedding, // float[]
    metadata = new { documentId = docId, chunkIndex = c.Index, text = c.TextSnippet }
  }).ToList()
});

// Query
var q = await pinecone.Indexes.QueryAsync(indexName, 
    new QueryRequest {
        vector = queryEmbedding,
        topK = 5,
        includeMetadata = true
    });

Tips

  • Use Pinecone Local for CI/dev to avoid cloud costs.

  • Store full text only as necessary — store short snippet + pointer to document for retrieval.

Retrieval + Answer generation (RAG)

When a user asks a question:

  1. Embed the question.

  2. Query vector DB (top-k, e.g. k=5–10).

  3. Retrieve metadata and text of top chunks.

  4. Build an LLM prompt: system + few-shot + include retrieved chunks as context + user question + instruction to cite chunk ids/pages.

  5. Call LLM (GPT) to generate final answer and ask it to produce a short list of sources (chunk ids and page numbers).

Prompt outline

System: You are a helpful assistant that answers using only the provided context. If context does not contain the answer, say "I don't know".
Context:
[CHUNK 1: doc=abc.pdf page=10] ...text...
[CHUNK 2: ...]
User question: "<user question here>"
Instruction: Provide a concise answer and at the end list the sources used as bullet points with document filename and page/chunk.

Always include a final step to verify that the answer does not hallucinate — instruct the model to say "No answer found" when confidence is low.

Example ASP.NET Core endpoints

  • POST /api/documents/upload — stores file, returns documentId and starts ingestion job.

  • GET /api/documents/{id}/status — ingestion progress.

  • POST /api/query — body { documentIds?:[], query: string } → returns { answer, sources[] }.

Query handler (simplified)

[HttpPost("query")]
public async Task<IActionResult> Query([FromBody] QueryRequest req)
{
    var qEmbedding = await _embedService.GetEmbeddingAsync(req.Query);
    var results = await _pinecone.QueryAsync(indexName, qEmbedding, topK: req.TopK ?? 5);
    var contexts = results.Matches.Select(m => m.Metadata);
    var prompt = _promptBuilder.BuildWithContexts(req.Query, contexts);
    var answer = await _llmService.GenerateAsync(prompt);
    return Ok(new { Answer = answer.Text, Sources = contexts.Select(c => new { c.documentId, c.chunkIndex }) });
}

Practical tips & production concerns

  • Batch embeddings: embed chunks in batches to reduce latency and cost.

  • Metadata: store filename, page number, chunk index, char offsets and a short snippet. This lets UI show highlighted context.

  • Pinecone config: choose appropriate index metric (cosine) and dimension matching chosen embedding model. Use namespaces to separate tenants or projects.

  • OCR: if PDFs are scanned images, run OCR (Tesseract or commercial) before embeddings.

  • Chunk sizing: too large → expensive embeddings; too small → context loss. 500–1500 tokens typical.

  • Cache query embeddings for repeated questions to same doc.

  • Cost control: monitor token usage and embedding calls; batch requests and cache embeddings.

  • Security: protect API keys (OpenAI, Pinecone) in Key Vault / environment variables; do not log raw PDF contents to logs in production.

  • Data retention & privacy: provide deletion workflow for uploaded docs; redact PII before sending to external APIs if required by policy.

Citations for major tool claims: OpenAI embeddings docs and Pinecone .NET SDK & docs are authoritative.
For PDF extraction via iText see iText documentation.
For local Pinecone emulator (useful for CI) see Pinecone Local docs.

Monitoring, testing and observability

  • Track ingestion metrics: files processed/day, average chunks per doc, embedding calls.

  • Log vector upsert failures and embedding failures separately.

  • Expose trace id for each query so you can connect answer back to the exact chunks used.

  • Add unit tests for chunker, integration tests for embedding+vector DB (use Pinecone Local), and end-to-end tests for RAG pipeline.

Deployment checklist

  • Store keys in Key Vault / secrets manager.

  • Use background worker (IHostedService or queue like Hangfire/BackgroundJobs) for ingestion.

  • Use retries with exponential backoff for network calls to OpenAI and Pinecone.

  • Limit concurrent ingestion jobs to avoid hitting rate limits.

  • Use containerized workers; Pinecone Local can be used in CI for tests.