RAG is Not Enough: Full Scan Wins in Computational Queries
A study reveals that increasing context in RAG worsens accuracy in aggregation tasks, and proposes a hybrid system that routes queries to a deterministic engine.
June 15, 2026 · 5 min read
TL;DR: An analysis reveals that RAG systems with large context windows fail in aggregation tasks. The solution is a hybrid approach that routes computational queries to a full-scan engine, achieving 100% accuracy.
What Happened?
An article published on Towards Data Science by an independent developer demonstrates that increasing the context window size in Retrieval-Augmented Generation (RAG) systems not only fails to improve accuracy in aggregation tasks but can also make errors harder to detect. The author built a full-scan engine that processes all rows of a dataset deterministically and compared it to traditional RAG pipelines on 100,000 rows. The results show that for queries requiring sums, averages, or other aggregate operations, the RAG approach frequently fails, while the full-scan engine achieves 100% accuracy.
This finding is not isolated. Historically, LLMs have shown weaknesses in arithmetic and symbolic reasoning tasks. For example, previous studies such as “Evaluating the Mathematical Capabilities of Large Language Models” (2023) found that GPT-3.5 and GPT-4 had error rates exceeding 30% on multi-digit addition problems when presented in long context. The RAG experiment replicates this limitation in a data retrieval environment, where the model must extract values from chunks and combine them. The author of the Towards Data Science article notes that as the context window increases, the model tends to “hallucinate” more figures or omit relevant rows, likely because attention is diluted over a larger number of tokens.
Why Is This Important?
RAG has become the dominant architecture for document question-answering systems, especially after the rise of LLMs. However, its inherently probabilistic design makes it unsuitable for queries requiring numerical or logical accuracy. Many companies are deploying RAG for financial data analysis, sales reports, or performance metrics, blindly trusting its answers. This study exposes a critical weakness: when a user asks for “total sales last quarter,” RAG may invent figures or miss rows, and increasing context only confuses the model further instead of improving it.
The impact is especially severe in sectors like banking, healthcare, and logistics, where an aggregation error can translate into million-dollar losses or incorrect clinical decisions. For example, a bank using RAG to summarize transactions might report an incorrect balance if the model omits some operations. Moreover, the trend of LLM providers (OpenAI, Anthropic, Google) releasing models with ever-larger context windows (128k, 200k, up to 1M tokens) could create a false sense of security. The article demonstrates that the problem is not context size but the probabilistic nature of the model: it cannot perform deterministic computations on datasets.
Market Implications
- User companies: Companies that have already deployed RAG for quantitative analysis will need to audit their pipelines and consider hybrid systems. A 2024 Gartner study estimates that 40% of enterprise RAG implementations have at least one critical error in aggregation tasks. Many are expected to adopt intelligent query routing.
- RAG platform providers: LangChain and LlamaIndex, two of the most popular frameworks, may incorporate full-scan modules for computational tasks. In fact, LangChain has already launched an experimental SQL integration, but it is not yet standard. This article accelerates the need to offer “deterministic compute engines” as part of their pipelines.
- Data teams: ML engineers and data scientists will need to define routing rules: which queries go to RAG and which to deterministic engines (SQL, pandas, vector databases with exact filters). This involves designing query intent classifiers, adding complexity but improving accuracy.
- Trust in LLMs: The perception that LLMs can replace traditional database systems will be affected. Companies like Databricks and Snowflake are pushing the combination of LLMs with SQL engines, and this study reinforces that direction. Demand for hybrid solutions that separate semantic retrieval from exact computation is expected to increase.
What Should Readers Know?
This is not about abandoning RAG, but understanding its limitations. RAG is excellent for retrieving relevant text chunks and generating responses based on them, but it fails when exact computations over multiple data points are needed. The proposed solution is a routing system: queries involving aggregations, exact filters, or calculations are sent to a full-scan engine that processes all rows deterministically, while semantic or open-ended search queries are handled by RAG. This hybrid approach combines the best of both worlds and can be implemented with tools like SQL, pandas, or internal search engines.
The article's author implemented a simple system: a rule-based classifier (e.g., keywords like “total,” “average,” “sum”) that redirects queries to a full-scan engine. In his tests, this system achieved 100% accuracy on aggregations, while RAG only reached 45% accuracy. Additionally, the full-scan engine was faster and cheaper, as it avoided LLM API calls for computational tasks.
“Increasing the context window in RAG does not fix the problem of aggregate queries; it only makes errors harder to detect.” — Source: Towards Data Science
For developers, the recommendation is clear: do not assume that an LLM with RAG can replace a database engine. Design systems that evaluate the query type and route traffic to the appropriate engine. This not only improves accuracy but also reduces computational costs by avoiding unnecessary LLM processing of large contexts. Furthermore, transparency should be considered: if a system uses RAG for aggregations, it should warn the user that results are approximate. In the long term, the industry could standardize a query routing protocol, similar to how database systems use query optimizers. This article is a call to action for the AI community to adopt a more critical and pragmatic approach.