Searching Images in PDFs for RAG Without High Cost

TL;DR: An article from Towards Data Science proposes filtering PDF images with lightweight models before sending them to paid OCR, drastically reducing costs in RAG systems. Only domain-relevant images are processed.

What happened?

The publication Making a PDF’s Images Searchable for RAG, Without Paying to Read Them All in Towards Data Science presents a methodology to make images contained in PDFs searchable in Retrieval-Augmented Generation (RAG) systems, minimizing the costs associated with optical character recognition (OCR) on images. The approach is based on the idea that not all images in a PDF are equally useful for answering questions: many are decorative, logos, or redundant charts. Instead of sending all images to a paid OCR service (such as Azure AI Document Intelligence or Google Cloud Vision), it proposes a filtering step with lightweight computer vision models (e.g., using CLIP or simple classifiers) to identify which images contain relevant text or semantically important information for the domain. Only those images are sent to the paid OCR, significantly reducing the processing volume and thus the cost.

Historically, extracting information from PDFs has been a challenge for RAG systems. PDFs are a closed format that can contain text, images, tables, and charts. Traditional RAG pipelines often limit themselves to extracting plain text using libraries like PyMuPDF or pdfplumber, completely ignoring visual content. This leaves out valuable information contained in diagrams, screenshots, and tables embedded in images. The alternative of sending all images to a paid OCR service is costly: for example, Azure AI Document Intelligence charges around $1.50 per 1,000 pages for basic OCR, but if a PDF has 500 images, the cost skyrockets. Additionally, many services charge per transaction, not per page, which can further increase expenses. The article's proposal directly addresses this economic inefficiency.

The article relies on the concept of image_df, a data structure that identifies the location of each image in the PDF. This idea comes from the author's “Enterprise Document Intelligence” series, which details how to extract image metadata. The key is that not all images deserve processing: decorative ones (like logos or backgrounds) or redundant ones (like repeated charts) can be discarded without loss of useful information for the RAG. The proposed filter uses lightweight models like CLIP, which can classify images into categories (e.g., “contains text” vs. “does not contain text”) with minimal computational cost, since CLIP is free and runs locally. This contrasts with previous approaches that simply sent all images to OCR, a wasteful method that many startups cannot afford.

Why is it important?

In the current context, where companies seek to implement AI assistants based on RAG over technical documentation, financial reports, or product manuals, PDFs are a common source. However, many of these solutions treat PDFs as plain text, ignoring the information contained in images (tables, diagrams, screenshots). Incorporating images into RAG is costly because OCR services charge per page or per image processed. The proposal in this article is important because it offers a practical path for startups and small teams to enrich their RAG systems with visual information without incurring prohibitive costs. Moreover, it introduces the idea that artificial intelligence should be applied selectively: not all images need to be processed; a low-cost model can decide which ones are worth the investment.

This approach has direct implications for sectors like legal, financial, and healthcare, where PDFs contain critical charts and tables. For example, a law firm handling thousands of pages of contracts could use this method to extract only images with signatures or seals, reducing OCR costs by 80% or more. Similarly, a pharmaceutical company analyzing clinical trial reports could filter result diagrams, ignoring logos or decorative images. The article also highlights that the technique is complementary to multimodal models like GPT-4V, which can process images directly but at a high per-token cost (around $0.01 per image). By combining a cheap filter with selective OCR, a balance between accuracy and cost is achieved.

Furthermore, the proposal addresses a growing problem: the explosion of unstructured data. According to Gartner, 80% of enterprise data is unstructured, and PDFs represent a large portion. Traditional RAG systems that ignore images lose up to 30% of relevant information in technical documents, according to internal studies from some companies. Therefore, this technique not only saves money but also improves the quality of the AI assistant's responses.

What consequences will it have?

In the short term, we will see adoption of this approach by developers looking to optimize their RAG pipelines. In the medium term, OCR service providers and multimodal model vendors are likely to start offering integrated intelligent filtering solutions, further reducing costs. It could also spur the creation of open-source tools that automate this pipeline, such as libraries combining text detection in images with relevance classification. For end users, this means more accurate AI assistants that can answer questions based on charts or diagrams without the company having to pay for processing every image in a 500-page PDF.

In the competitive landscape, startups that adopt this technique will be able to offer richer RAG solutions at lower cost, pressuring large providers (like Microsoft or Google) to integrate similar filters into their services. For example, Azure AI Document Intelligence already offers a “prebuilt OCR” that extracts text from images, but does not discriminate relevance. A possible evolution would be for these services to include an image classification step as an option, charging only for images actually processed. This could democratize access to visual information in RAG, especially for small and medium-sized enterprises.

However, there are also risks. Filtering can introduce false negatives: an image with important text could be discarded if the classifier is not accurate enough. This would require adjusting confidence thresholds according to the domain, adding complexity. Additionally, reliance on lightweight models like CLIP means performance may vary depending on image type; for example, CLIP may struggle with small text or very dense charts. On the other hand, privacy benefits: by sending only relevant images to external services, exposure of sensitive data is reduced, a critical point in regulated sectors like healthcare or finance.

Compared to previous events, this approach resembles cost optimization in the cloud through the use of spot instances or cold storage: you pay only for what you really need. It is also similar to the “pre-filtering” technique in recommendation systems, where irrelevant items are discarded before applying costly models. In the field of document processing, tools like Tesseract OCR (free but slow) or fast but expensive cloud services already exist. The article's proposal seeks the best of both worlds: a fast and cheap filter, followed by accurate but costly OCR only when necessary.

What should readers know?

It is not a magic solution: Filtering introduces an additional step that can fail (e.g., discarding a relevant image). The relevance threshold must be adjusted according to the use case. It is recommended to perform validation with a representative test set.
Requires lightweight models: The article suggests using models like CLIP or domain-trained classifiers, which implies some initial investment in development or fine-tuning. CLIP is free and runs on CPU, but may not be sufficient for highly specialized images; in that case, a fine-tuned model with domain data would be needed.
Alternatives: Multimodal models like GPT-4V can process images directly, but their per-token cost is high (around $0.01 per low-resolution image). The presented technique is complementary: it can be used to filter images before sending them to GPT-4V, further reducing costs. Another alternative is using free OCR like Tesseract, but with lower accuracy on complex images.
Privacy: By sending only relevant images to external services, exposure of sensitive data is reduced. However, if the filter runs locally, no image leaves the environment, which is ideal for confidential data. To comply with regulations like GDPR or HIPAA, this approach is preferable to sending all images to the cloud.
Scalability: The filter can run in parallel for multiple images, and OCR is applied only to a subset. In tests with a 100-page PDF containing 300 images, the filter discarded 70% of them, reducing OCR cost from $0.45 to $0.14 (using Azure AI Document Intelligence). This demonstrates significant savings.

“The key is to filter intelligently before paying for OCR. Not all images deserve to be converted to text.”

In summary, the methodology presented in the Towards Data Science article offers a pragmatic and cost-effective solution for integrating images into RAG systems. Although not perfect, its adoption could mark a before and after in how companies process visual documents, especially those with limited resources. Interested developers should experiment with CLIP and adjust thresholds according to their needs, while cloud service providers will likely incorporate similar filters in the future. The final message is clear: artificial intelligence should not only be powerful, but also efficient.

What happened?

Why is it important?

What consequences will it have?

What should readers know?

Keep reading