Close Menu
TechurzTechurz

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    What's Hot

    Here’s how we picked this year’s Innovators Under 35

    September 1, 2025

    Building Tech With No Experience Taught Me This Key Skill

    September 1, 2025

    I’ve tried 3 different smart rings but I keep going back to Apple Watch – here’s why

    September 1, 2025
    Facebook X (Twitter) Instagram
    Trending
    • Here’s how we picked this year’s Innovators Under 35
    • Building Tech With No Experience Taught Me This Key Skill
    • I’ve tried 3 different smart rings but I keep going back to Apple Watch – here’s why
    • You can buy an iPhone 16 Pro for $250 off on Amazon right now – how the deal works
    • ‘Cyberpunk 2077’ Is Teasing Something For Three Days From Now
    • WhatsApp 0-Day, Docker Bug, Salesforce Breach, Fake CAPTCHAs, Spyware App & More
    • 5 days left: Exhibit tables are disappearing for Disrupt 2025
    • Is AI the end of software engineering or the next step in its evolution?
    Facebook X (Twitter) Instagram Pinterest Vimeo
    TechurzTechurz
    • Home
    • AI
    • Apps
    • News
    • Guides
    • Opinion
    • Reviews
    • Security
    • Startups
    TechurzTechurz
    Home»AI»Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution
    AI

    Why enterprise RAG systems fail: Google study introduces ‘sufficient context’ solution

    TechurzBy TechurzMay 23, 2025No Comments8 Mins Read
    Share Facebook Twitter Pinterest LinkedIn Tumblr Reddit Telegram Email
    Why enterprise RAG systems fail: Google study introduces 'sufficient context' solution
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

    A new study from Google researchers introduces “sufficient context,” a novel perspective for understanding and improving retrieval augmented generation (RAG) systems in large language models (LLMs).

    This approach makes it possible to determine if an LLM has enough information to answer a query accurately, a critical factor for developers building real-world enterprise applications where reliability and factual correctness are paramount.

    The persistent challenges of RAG

    RAG systems have become a cornerstone for building more factual and verifiable AI applications. However, these systems can exhibit undesirable traits. They might confidently provide incorrect answers even when presented with retrieved evidence, get distracted by irrelevant information in the context, or fail to extract answers from long text snippets properly.

    The researchers state in their paper, “The ideal outcome is for the LLM to output the correct answer if the provided context contains enough information to answer the question when combined with the model’s parametric knowledge. Otherwise, the model should abstain from answering and/or ask for more information.”

    Achieving this ideal scenario requires building models that can determine whether the provided context can help answer a question correctly and use it selectively. Previous attempts to address this have examined how LLMs behave with varying degrees of information. However, the Google paper argues that “while the goal seems to be to understand how LLMs behave when they do or do not have sufficient information to answer the query, prior work fails to address this head-on.”

    Sufficient context

    To tackle this, the researchers introduce the concept of “sufficient context.” At a high level, input instances are classified based on whether the provided context contains enough information to answer the query. This splits contexts into two cases:

    Sufficient Context: The context has all the necessary information to provide a definitive answer.

    Insufficient Context: The context lacks the necessary information. This could be because the query requires specialized knowledge not present in the context, or the information is incomplete, inconclusive or contradictory.

    Source: arXiv

    This designation is determined by looking at the question and the associated context without needing a ground-truth answer. This is vital for real-world applications where ground-truth answers are not readily available during inference.

    The researchers developed an LLM-based “autorater” to automate the labeling of instances as having sufficient or insufficient context. They found that Google’s Gemini 1.5 Pro model, with a single example (1-shot), performed best in classifying context sufficiency, achieving high F1 scores and accuracy.

    The paper notes, “In real-world scenarios, we cannot expect candidate answers when evaluating model performance. Hence, it is desirable to use a method that works using only the query and context.”

    Key findings on LLM behavior with RAG

    Analyzing various models and datasets through this lens of sufficient context revealed several important insights.

    As expected, models generally achieve higher accuracy when the context is sufficient. However, even with sufficient context, models tend to hallucinate more often than they abstain. When the context is insufficient, the situation becomes more complex, with models exhibiting both higher rates of abstention and, for some models, increased hallucination.

    Interestingly, while RAG generally improves overall performance, additional context can also reduce a model’s ability to abstain from answering when it doesn’t have sufficient information. “This phenomenon may arise from the model’s increased confidence in the presence of any contextual information, leading to a higher propensity for hallucination rather than abstention,” the researchers suggest.

    A particularly curious observation was the ability of models sometimes to provide correct answers even when the provided context was deemed insufficient. While a natural assumption is that the models already “know” the answer from their pre-training (parametric knowledge), the researchers found other contributing factors. For example, the context might help disambiguate a query or bridge gaps in the model’s knowledge, even if it doesn’t contain the full answer. This ability of models to sometimes succeed even with limited external information has broader implications for RAG system design.

    Source: arXiv

    Cyrus Rashtchian, co-author of the study and senior research scientist at Google, elaborates on this, emphasizing that the quality of the base LLM remains critical. “For a really good enterprise RAG system, the model should be evaluated on benchmarks with and without retrieval,” he told VentureBeat. He suggested that retrieval should be viewed as “augmentation of its knowledge,” rather than the sole source of truth. The base model, he explains, “still needs to fill in gaps, or use context clues (which are informed by pre-training knowledge) to properly reason about the retrieved context. For example, the model should know enough to know if the question is under-specified or ambiguous, rather than just blindly copying from the context.”

    Reducing hallucinations in RAG systems

    Given the finding that models may hallucinate rather than abstain, especially with RAG compared to no RAG setting, the researchers explored techniques to mitigate this.

    They developed a new “selective generation” framework. This method uses a smaller, separate “intervention model” to decide whether the main LLM should generate an answer or abstain, offering a controllable trade-off between accuracy and coverage (the percentage of questions answered).

    This framework can be combined with any LLM, including proprietary models like Gemini and GPT. The study found that using sufficient context as an additional signal in this framework leads to significantly higher accuracy for answered queries across various models and datasets. This method improved the fraction of correct answers among model responses by 2–10% for Gemini, GPT, and Gemma models.

    To put this 2-10% improvement into a business perspective, Rashtchian offers a concrete example from customer support AI. “You could imagine a customer asking about whether they can have a discount,” he said. “In some cases, the retrieved context is recent and specifically describes an ongoing promotion, so the model can answer with confidence. But in other cases, the context might be ‘stale,’ describing a discount from a few months ago, or maybe it has specific terms and conditions. So it would be better for the model to say, ‘I am not sure,’ or ‘You should talk to a customer support agent to get more information for your specific case.’”

    The team also investigated fine-tuning models to encourage abstention. This involved training models on examples where the answer was replaced with “I don’t know” instead of the original ground-truth, particularly for instances with insufficient context. The intuition was that explicit training on such examples could steer the model to abstain rather than hallucinate.

    The results were mixed: fine-tuned models often had a higher rate of correct answers but still hallucinated frequently, often more than they abstained. The paper concludes that while fine-tuning might help, “more work is needed to develop a reliable strategy that can balance these objectives.”

    Applying sufficient context to real-world RAG systems

    For enterprise teams looking to apply these insights to their own RAG systems, such as those powering internal knowledge bases or customer support AI, Rashtchian outlines a practical approach. He suggests first collecting a dataset of query-context pairs that represent the kind of examples the model will see in production. Next, use an LLM-based autorater to label each example as having sufficient or insufficient context. 

    “This already will give a good estimate of the % of sufficient context,” Rashtchian said. “If it is less than 80-90%, then there is likely a lot of room to improve on the retrieval or knowledge base side of things — this is a good observable symptom.”

    Rashtchian advises teams to then “stratify model responses based on examples with sufficient vs. insufficient context.” By examining metrics on these two separate datasets, teams can better understand performance nuances. 

    “For example, we saw that models were more likely to provide an incorrect response (with respect to the ground truth) when given insufficient context. This is another observable symptom,” he notes, adding that “aggregating statistics over a whole dataset may gloss over a small set of important but poorly handled queries.”

    While an LLM-based autorater demonstrates high accuracy, enterprise teams might wonder about the additional computational cost. Rashtchian clarified that the overhead can be managed for diagnostic purposes. 

    “I would say running an LLM-based autorater on a small test set (say 500-1000 examples) should be relatively inexpensive, and this can be done ‘offline’ so there’s no worry about the amount of time it takes,” he said. For real-time applications, he concedes, “it would be better to use a heuristic, or at least a smaller model.” The crucial takeaway, according to Rashtchian, is that “engineers should be looking at something beyond the similarity scores, etc, from their retrieval component. Having an extra signal, from an LLM or a heuristic, can lead to new insights.”

    Daily insights on business use cases with VB Daily

    If you want to impress your boss, VB Daily has you covered. We give you the inside scoop on what companies are doing with generative AI, from regulatory shifts to practical deployments, so you can share insights for maximum ROI.

    Read our Privacy Policy

    Thanks for subscribing. Check out more VB newsletters here.

    An error occured.

    context enterprise fail Google introduces RAG Solution study sufficient systems
    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleOver 91% of companies sacrifice hybrid cloud security in the AI adoption rush
    Next Article Canalys: Xiaomi leads global wearables market in Q1
    Techurz
    • Website

    Related Posts

    AI

    Here’s how we picked this year’s Innovators Under 35

    September 1, 2025
    AI

    I’ve tried 3 different smart rings but I keep going back to Apple Watch – here’s why

    September 1, 2025
    AI

    Is AI the end of software engineering or the next step in its evolution?

    September 1, 2025
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    You Can Now Get Starlink for $15-Per-Month in New York, but There’s a Catch

    July 11, 20257 Views

    Non-US businesses want to cut back on using US cloud systems

    June 2, 20257 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram
    Latest Reviews

    Subscribe to Updates

    Get the latest tech news from FooBar about tech, design and biz.

    Most Popular

    Start Saving Now: An iPhone 17 Pro Price Hike Is Likely, Says New Report

    August 17, 20258 Views

    You Can Now Get Starlink for $15-Per-Month in New York, but There’s a Catch

    July 11, 20257 Views

    Non-US businesses want to cut back on using US cloud systems

    June 2, 20257 Views
    Our Picks

    Here’s how we picked this year’s Innovators Under 35

    September 1, 2025

    Building Tech With No Experience Taught Me This Key Skill

    September 1, 2025

    I’ve tried 3 different smart rings but I keep going back to Apple Watch – here’s why

    September 1, 2025

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    Facebook X (Twitter) Instagram Pinterest
    • About Us
    • Contact Us
    • Privacy Policy
    • Terms and Conditions
    • Disclaimer
    © 2025 techurz. Designed by Pro.

    Type above and press Enter to search. Press Esc to cancel.