What I Built
Built a multi-modal RAG system that answers plain-English questions about 1,000+ U.S. government PDFs with cited, grounded responses. Hybrid retrieval (BM25 keyword + dense vector + reciprocal rank fusion) ensures exact terms and semantic matches are both captured. GPT-4o-mini vision captions charts, tables, and figures so visual content becomes searchable. Every response includes page-level citations and per-request cost/latency metrics.
What I Learned
Treating PDFs as text-only throws away half the answer. Captioning 13,009 images with GPT-4o-mini vision made diagrams, charts, and schematics searchable — pushing eval accuracy to 86.7%. Hybrid search (BM25 + dense + RRF) consistently outperformed pure dense retrieval because government documents are full of exact grant numbers and IDs that embeddings miss.
Also learned that A/B testing retrieval parameters matters: k=3 beat k=7 with 93.3% vs 80.0% accuracy at ~50% lower cost — more context isn’t always better.
Key Results
| Metric | Value |
|---|---|
| Eval accuracy (15 questions) | 86.7% (13/15 pass) |
| RAGAS faithfulness | 0.77 |
| RAGAS answer relevancy | 0.98 |
| RAGAS context precision | 0.85 |
| Corpus indexed | 1,076 PDFs → 34,374 text chunks + 13,009 captions |
| Query cost | ~$0.0003 per query |
| Total index cost | ~$7 one-time |
| p50 / p95 latency | 2,900 ms / 8,800 ms |
Project
Tech Stack: PyMuPDF, GPT-4o-mini Vision, OpenAI Embeddings (1536-dim), FAISS, BM25, Reciprocal Rank Fusion, Streamlit | Data: 1,076 Library of Congress .gov PDFs
Citation
@online{prasanna_koppolu,
author = {Prasanna Koppolu, Bhanu},
title = {Multi-Modal {RAG} for {Government} {PDFs}},
url = {https://bhanuprasanna2001.github.io/projects/multimodal_rag.html},
langid = {en}
}