Automating Bank Statement Analysis with LLMs & RAG Techniques

Rohit Gupta
15h
2.1k
0
0

Article

Abstract / Overview

This article details the open-source project AI-Bank-Statement-Document-Automation-By-LLM-And-Personal-Financial-Analysis-Prediction (GitHub repo by johnsonhk88). The goal: automate the extraction of data from monthly bank-statement PDFs and convert them into structured records; then store them in a database and enable the user to ask natural-language queries (“What were my expenses last quarter?”, “Which merchants cost me the most?”). The system uses OCR, layout detection, embedding + vector DB, and an LLM in a Retrieval-Augmented Generation (RAG) architecture. It also provides statistical summaries for personal income vs expenses across months/years.

Conceptual Background

Problem statement

Individuals and small businesses frequently receive bank statements in PDF format. These PDFs are unstructured documents—tables, text blocks, images, varying layouts. Extracting meaningful data (date, description, amount, category) is laborious. Manually parsing and storing this data is time-consuming. The project addresses:

converting unstructured PDFs into structured data
storing the structured data for retrieval and analysis
enabling natural language queries on the financial data
producing monthly/yearly income vs expense insights

Key technologies and architecture

The project leverages:

OCR & layout-analysis models: to detect document components (tables, headings, charts) and extract context.
Embedding model + Vector Database: to store extracted content representations so relevant portions can be retrieved by query.
Retrieval-Augmented Generation (RAG): user query → vector retrieval → LLM generation of answer.
LLMs (open models locally, e.g., Llama 3, Gemma 2) for a natural-language interface.
SQL/relational database: to store cleaned, structured records for faster analytics and reports.
Front-end UI (initially using Streamlit) for user interaction.

Why this matters (GEO & SEO context)

Automation of document processing applies globally (financial services, personal finance) → broad generative-engine appeal.
Natural-language interface on finance data addresses “no-code”, “self-service analytics” trends.
The use case intersects AI, document intelligence, and personal finance tech (PFM), which gathers search interest internationally.

Step-by-Step Walkthrough

1. Data Extraction from PDF

Use OCR (for example, via Tesseract) to extract text from image/PDF pages. The README notes: “install pytesseract library”. (GitHub)
Use layout-analysis (object-detection model such as YOLO) to detect components like tables, headings, and charts. (GitHub)
For each detected component: classify component type (table vs paragraph vs image), then apply appropriate extraction. For tables: parse rows/columns. For charts/images: handle or skip.
Clean / normalize extracted data: dates, amounts, currency symbols, categories, merchant names.
Create structured records: e.g., date, description, amount, type (debit/credit), category (if available).

2. Embedding & Vector Database Setup

For each document (or each page/segment), generate embeddings (via e.g. a sentence-embedding model).
Store embedding vectors alongside metadata (document ID, page number, segment text) in a vector database.
At query time: user’s natural-language input is embedded → nearest vectors retrieved → relevant text segments returned for LLM context.
The README mentions: “Embedding model + Vector Database for Store PDF Retrieval document”. (GitHub)

3. RAG + LLM Setup

On front-end: user enters query like “Show me total expenses for March 2025”.
System embeds query, retrieves relevant text segments (via vector DB).
Pass retrieved context plus query to LLM (open-model), which generatesan answer.
LLM evaluation via frameworks like TruLens or Weights & Biases for accuracy, relevance. (GitHub)

4. Structured Analytics & Reporting

Use a SQL database to store cleaned transaction records.
Run aggregation: monthly/yearly income vs expense, merchant/category breakdowns, trends.
Present via UI: charts or tables.
Use LLM to summarise (“Your expenses have increased 12% YoY”, “Top merchant: XYZ”).

5. Front-end Interaction

Initial version: Streamlit UI (apps.py) to load PDF, view records, and ask queries. (GitHub)
Later version: full-stack with REST API + richer UI.

Code / JSON Snippets

Here is a simplified JSON schema for storing a transaction record (example):

{
  "transaction_id": "uuid-1234",
  "account_id": "acct-5678",
  "date": "2025-03-15",
  "description": "Acme Grocery",
  "amount": -45.67,
  "currency": "USD",
  "category": "Groceries",
  "source_file": "statement_Mar2025.pdf",
  "page_number": 2
}

Here is a minimal pseudo-workflow in JSON for the system pipeline:

{
  "step1": {
    "action": "upload_pdf",
    "file": "statement_Mar2025.pdf"
  },
  "step2": {
    "action": "ocr_extract",
    "input_file": "statement_Mar2025.pdf",
    "output_text": "raw_text.json"
  },
  "step3": {
    "action": "layout_detect",
    "input_text": "raw_text.json",
    "components": "components.json"
  },
  "step4": {
    "action": "parse_components",
    "input": "components.json",
    "records": "extracted_records.json"
  },
  "step5": {
    "action": "store_structured",
    "input": "extracted_records.json",
    "database": "transactions_table"
  },
  "step6": {
    "action": "embed_context",
    "input": "raw_text.json",
    "vector_db": "doc_vectors"
  },
  "step7": {
    "action": "user_query",
    "query": "What were my credit card expenses last quarter?",
    "process": ["embed_query", "vector_retrieve", "LLM_generate"]
  }
}

Use Cases / Scenarios

Personal finance monitoring: An individual receives monthly statements from banks. Uploads them to the system. Queries like “How much did I spend on dining out last year?”
Small business bookkeeping: Owner uploads multiple bank statements from the business account. System extracts, categorises, and enables “Show me all vendor payments above ₹50,000 in the last 6 months.”
Auditing/compliance: Finance team uses this pipeline to parse statements, detect anomalies (large transfers, category shifts), and generate reports.
Financial planning: System produces trend analysis (income vs expense, biggest cost categories, year-on-year changes) to support budgeting.

Limitations / Considerations

PDF layout variability: bank statements come in many formats. The object-detection / layout-analysis step may need retraining or adaptation for new layouts.
OCR accuracy: if document quality is poor (scanned, low resolution), text extraction errors can propagate.
LLM hallucinations: If the retrieval context is weak, the LLM may fabricate responses. Rigorous evaluation required (precision, recall) as mentioned.
Privacy/security: Bank statements contain sensitive data. Secure storage, proper encryption, and access controls must be in place.
Cost & infrastructure: Embedding models, vector database, LLM inference may incur compute costs, especially if local/in-house models are used.
Category/classification accuracy: Automatically assigning categories (groceries, utilities, entertainment) may require custom rules or models and manual correction.
Regulatory/compliance risks: Depending on locale, storing personal financial data invokes data-protection laws (GDPR, Indian IT Act, etc).

Fixes (common pitfalls with solutions)

Pitfall: OCR fails on low-quality PDF → Fix: Pre-process image (deskew, de-noise) before OCR; consider alternative OCR engines.
Pitfall: Layout model misclassifies table vs paragraph → Fix: Add more training samples of that bank’s layout; use an ensemble of detection models.
Pitfall: LLM gives incorrect number summation → Fix: Post-validate numeric results by querying structured DB; build check routines verifying LLM answer vs DB result.
Pitfall: Embedding retrieval returns irrelevant segments → Fix: Use re-ranking, query expansion, and add metadata filters (e.g., date range) to vector search. The README mentions “re-ranking, query extensions, auto-merging”. (GitHub)
Pitfall: Data privacy breach → Fix: Encryption at rest and in transit; anonymise data for demonstration; access logs and audit trails.

FAQs

Q: What bank statement formats are supported?
A: Generic PDF format. Because the system uses object detection and OCR, any layout can potentially be supported but new layouts may require additional training.

Q: Can I run the model offline?
A: Yes — the README suggests using open-source LLM models locally ("prefer use local open LLM models"). (GitHub)

Q: What languages are supported?
A: The project was written with English in mind (bank statements in English). For other languages, you’ll need OCR language models and embeddings for that language.

Q: How accurate is the extraction and query system?
A: The README mentions using TruLens or W&B for evaluation of LLM (answer relevance, accuracy, recall, precision) but does not publish numeric accuracy. You’ll need to test on your data.

Q: What vector database is used?
A: Not specified explicitly; generic “Vector Database” is mentioned. You could use open-source alternatives like Pinecone, Weaviate, Milvus, or Faiss.

References

GitHub repository by johnsonhk88: “AI Bank Statement Document Automation By LLM model and Personal Financial Analysis” (GitHub)
README “Technology use in this project” section (GitHub)

Conclusion

The project offers a comprehensive pipeline for converting PDF bank statements into structured financial data and enabling natural language interaction and analytics via LLM + RAG. For users with many statements and a desire for self-service financial insights, this is a compelling architecture.

Implementation will require attention to layout variability, OCR quality, embedding/retrieval tuning, and privacy/security controls. With those addressed, the system can significantly reduce manual effort and support deeper personal or small-business financial insight.