Docling: Process PDFs for RAG Locally, Without the Cloud

TL;DR: Docling is an open-source tool from IBM that extracts tables, performs OCR, and captures the full structure of PDFs locally, without sending data to the cloud. Ideal for RAG systems requiring privacy and control.

What happened?

IBM has released Docling, an open-source library that processes PDF documents locally for Retrieval-Augmented Generation (RAG) systems. Unlike commercial solutions that require uploading files to the cloud, Docling runs the entire pipeline—OCR, table extraction, header detection, and captions—on the user's own machine. According to an article on Towards Data Science, Docling offers accuracy comparable to cloud services like Azure Document Intelligence or Google Document AI, but without per-page costs or reliance on an internet connection. Historically, PDF processing has been a challenge: traditional tools like Tesseract OCR (released by Google in 2006) offered basic OCR but lacked structured table extraction and layout. Later, solutions like Unstructured (2022) and PyMuPDF improved extraction but still required multiple libraries and tweaks. Docling unifies everything in a single command, marking a milestone in the maturity of the open-source document ecosystem.

Why is it important?

Extracting information from PDFs is one of the most critical bottlenecks in implementing enterprise RAG. Many organizations handle sensitive documents (contracts, financial reports, medical records) that cannot be sent to external servers due to privacy policies or regulations like GDPR, HIPAA, or the Personal Data Protection Law. Docling removes that barrier: it offers a complete pipeline that includes:

Optical character recognition (OCR) for scanned documents, with support for multiple languages.
Table extraction with cell, row, and column structure, including nested and borderless tables.
Detection of headers, footers, titles, and captions, preserving document hierarchy.
Support for mathematical formulas and diagrams, thanks to integration with computer vision models.

Additionally, Docling produces a structural representation in JSON format that can directly feed language models for RAG, facilitating the creation of virtual assistants that answer questions about internal documents. According to IBM's internal tests, Docling achieves 92% accuracy in table extraction compared to 95% for Azure Document Intelligence, but with the advantage that data never leaves the device. This is crucial for sectors like banking, where 78% of institutions (according to a 2023 Deloitte study) cite privacy as the main barrier to adopting AI in the cloud.

What consequences will it have?

In the short term, Docling democratizes access to high-quality document extraction tools, especially for startups and small teams that cannot afford the costs of cloud APIs (which typically charge between $0.01 and $0.05 per page). In the long term, it could accelerate the adoption of RAG in regulated sectors like banking, healthcare, and public administration, where data privacy is paramount. It also provides an alternative to proprietary solutions, reducing dependence on external vendors and fostering data sovereignty.

However, local processing has limitations: it requires sufficient hardware (GPU recommended for heavy OCR, though it works on CPU) and does not benefit from continuous improvements in cloud models. Docling directly competes with tools like Unstructured, PyMuPDF, and Tesseract OCR, but integrates everything into a single package. In comparison, Unstructured requires configuring multiple connectors and has a steeper learning curve. Docling also differs from services like Amazon Textract due to its open-source nature and lack of recurring costs.

In the market, this could pressure cloud providers to reduce prices or improve their free offerings. Additionally, being from IBM, an established player, Docling has a high likelihood of receiving long-term maintenance, unlike smaller community projects.

What should readers know?

Easy installation: pip install docling. Compatible with Python 3.8+. Includes dependencies like PyTorch and transformers.
Output formats: Structured JSON, Markdown, and visual table representation. Also exports to formats like CSV.
Performance: Processes a 10-page PDF with tables in ~15 seconds on a modern CPU (Intel i7, no GPU). With GPU (NVIDIA T4), it reduces to ~5 seconds.
Limitations: Still in early stage (version 0.1.0); may fail with very complex documents (like low-quality scans) or unsupported languages (currently English, Spanish, French, German, Italian, Portuguese, and Simplified Chinese).
License: Apache 2.0, allowing unrestricted commercial use. The code is available on GitHub with detailed documentation.
Use cases: Ideal for extracting data from invoices, financial reports, academic papers, and government forms.

"Docling is the first open-source pipeline that combines OCR, table extraction, and document structure in a single step, without needing to send data to the cloud." — Towards Data Science

Conclusion

Docling represents a significant advancement for local PDF processing in the context of RAG. Its focus on privacy, accuracy, and open source makes it a valuable tool for any organization seeking to extract information from documents without compromising security. Although it will not replace cloud solutions in all scenarios (especially those requiring massive scalability or up-to-date language models), it offers a viable and free alternative worth attention. For developers and businesses, it is an opportunity to reduce costs and increase control over their data. The community is expected to contribute improvements, expanding language support and optimizing performance. Ultimately, Docling marks a step toward democratizing document intelligence, aligning with the "local AI" trend that seeks to balance advanced capabilities with privacy.

What happened?

Why is it important?

What consequences will it have?

What should readers know?

Conclusion

Keep reading