Multi-Modal RAG for Government PDFs – Bhanu Prasanna Koppolu

What I Built

Built a multi-modal RAG system that answers plain-English questions about 1,000+ U.S. government PDFs with cited, grounded responses. Hybrid retrieval (BM25 keyword + dense vector + reciprocal rank fusion) ensures exact terms and semantic matches are both captured. GPT-4o-mini vision captions charts, tables, and figures so visual content becomes searchable. Every response includes page-level citations and per-request cost/latency metrics.

What I Learned

Treating PDFs as text-only throws away half the answer. Captioning 13,009 images with GPT-4o-mini vision made diagrams, charts, and schematics searchable — pushing eval accuracy to 86.7%. Hybrid search (BM25 + dense + RRF) consistently outperformed pure dense retrieval because government documents are full of exact grant numbers and IDs that embeddings miss.

Also learned that A/B testing retrieval parameters matters: k=3 beat k=7 with 93.3% vs 80.0% accuracy at ~50% lower cost — more context isn’t always better.

Key Results

Metric	Value
Eval accuracy (15 questions)	86.7% (13/15 pass)
RAGAS faithfulness	0.77
RAGAS answer relevancy	0.98
RAGAS context precision	0.85
Corpus indexed	1,076 PDFs → 34,374 text chunks + 13,009 captions
Query cost	~$0.0003 per query
Total index cost	~$7 one-time
p50 / p95 latency	2,900 ms / 8,800 ms

Project

🔗 GitHub Repository

Tech Stack: PyMuPDF, GPT-4o-mini Vision, OpenAI Embeddings (1536-dim), FAISS, BM25, Reciprocal Rank Fusion, Streamlit | Data: 1,076 Library of Congress .gov PDFs

Citation

BibTeX citation:

@online{prasanna_koppolu,
  author = {Prasanna Koppolu, Bhanu},
  title = {Multi-Modal {RAG} for {Government} {PDFs}},
  url = {https://bhanuprasanna2001.github.io/projects/multimodal_rag.html},
  langid = {en}
}

For attribution, please cite this work as:

Prasanna Koppolu, Bhanu. n.d. “Multi-Modal RAG for Government PDFs.” https://bhanuprasanna2001.github.io/projects/multimodal_rag.html.