OCR vs. Image Embeddings for PDF RAG: Which One is Better?

Weaviate vector database ยท Beginner ยท๐Ÿ” RAG & Vector Search ยท1mo ago
Skills: RAG Basics90%
My colleagues at Weaviate released IRPAPERS, a benchmark comparing ๐—ถ๐—บ๐—ฎ๐—ด๐—ฒ-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ and ๐˜๐—ฒ๐˜…๐˜-๐—ฏ๐—ฎ๐˜€๐—ฒ๐—ฑ retrieval over 3,230 pages from 166 scientific papers. The setup: Take the same PDFs and process them two ways. For text, run OCR with GPT-4.1 and embed with Arctic 2.0 + BM25 hybrid search. For images, embed raw page images with ColModernVBERT multi-vector embeddings. Test both on 180 needle-in-the-haystack questions. ๐—ง๐—ต๐—ฒ ๐—ฟ๐—ฒ๐˜€๐˜‚๐—น๐˜๐˜€: Text edges out images at the top rank: 46% vs 43% Recall@1 But images match or exceed text at deeper recall: 93% vs 91% Recall@20 But text and image based methods actually fail on ๐˜ฅ๐˜ช๐˜ง๐˜ง๐˜ฆ๐˜ณ๐˜ฆ๐˜ฏ๐˜ ๐˜ฒ๐˜ถ๐˜ฆ๐˜ณ๐˜ช๐˜ฆ๐˜ด. At Recall@1: โ€ข 22 queries succeed with text but fail with images โ€ข 18 queries succeed with images but fail with text This complementarity is what makes ๐— ๐˜‚๐—น๐˜๐—ถ๐—บ๐—ผ๐—ฑ๐—ฎ๐—น ๐—›๐˜†๐—ฏ๐—ฟ๐—ถ๐—ฑ ๐—ฆ๐—ฒ๐—ฎ๐—ฟ๐—ฐ๐—ต work. By fusing scores from both text and image retrieval, they achieved: โ€ข 49% Recall@1 (beating either modality alone) โ€ข 81% Recall@5 โ€ข 95% Recall@20 00:00 - Intro 00:08 - Visual- vs Text-based methods 01:04 - The IRPapers dataset 01:59 - The 6 different search strategies 03:43 - The results 04:30 - The paper's most interesting finding... 05:11 - Conclusion
Watch on YouTube โ†— (saves to browser)
Sign in to unlock AI tutor explanation ยท โšก30

Related AI Lessons

โšก
Zero-Trust RAG: Defeating the Shared Private Link Deadlock in Azure Terraform
Learn to overcome the shared private link deadlock in Azure Terraform using Zero-Trust RAG
Dev.to ยท david
โšก
Choosing the Right RAG Strategy A Complete Decision Guide to Chunking, Agentic RAG, and GraphRAG
Learn how to choose the right RAG strategy for your pipeline, including chunking, agentic RAG, and GraphRAG, to improve performance and efficiency
Dev.to ยท Seenivasa Ramadurai
โšก
The simplest self-hosted RAG you'll ever set up (Apache 2.0, 20K stars)
Set up a simple self-hosted RAG with MaxKB, balancing simplicity and ease of use
Dev.to ยท retrovirusretro
โšก
Tencent just released a RAG framework and nobody's talking about it
Tencent's WeChat team releases WeKnora, a RAG framework, as open source, which can be utilized for various applications
Dev.to ยท retrovirusretro

Chapters (7)

Intro
0:08 Visual- vs Text-based methods
1:04 The IRPapers dataset
1:59 The 6 different search strategies
3:43 The results
4:30 The paper's most interesting finding...
5:11 Conclusion
Up next
Watch this before applying for jobs as a developer.
Tech With Tim
Watch โ†’