arxiv:2506.01062

SealQA: Raising the Bar for Reasoning in Search-Augmented Language Models

Published on Jun 1

· Submitted by

tuvu on Jun 3

Upvote

Authors:

Tu Vu

Abstract

SealQA evaluates search-augmented language models' performance on fact-seeking questions with conflicting or noisy search results, revealing limitations in reasoning and factual accuracy.

AI-generated summary

We introduce SealQA, a new challenge benchmark for evaluating SEarch-Augmented Language models on fact-seeking questions where web search yields conflicting, noisy, or unhelpful results. SealQA comes in three flavors: (1) Seal-0 (main) and (2) Seal-Hard, which assess factual accuracy and reasoning capabilities, with Seal-0 focusing on the most challenging questions where chat models (e.g., GPT-4.1) typically achieve near-zero accuracy; and (3) LongSeal, which extends SealQA to test long-context, multi-document reasoning in "needle-in-a-haystack" settings. Our evaluation reveals critical limitations in current models: Even frontier LLMs perform poorly across all SealQA flavors. On Seal-0, frontier agentic models equipped with tools like o3 and o4-mini achieve only 17.1% and 6.3% accuracy, respectively, at their best reasoning efforts. We find that advanced reasoning models such as DeepSeek-R1-671B and o3-mini are highly vulnerable to noisy search results. Notably, increasing test-time compute does not yield reliable gains across o3-mini, o4-mini, and o3, with performance often plateauing or even declining early. Additionally, while recent models are less affected by the "lost-in-the-middle" issue, they still fail to reliably identify relevant documents in LongSeal when faced with numerous distractors. To facilitate future work, we release SealQA at huggingface.co/datasets/vtllms/sealqa.

View arXiv page View PDF Add to collection

Community

tuvu

Paper author Paper submitter 2 days ago

SealQA: A challenge benchmark for retrieval-augmented generation / too-use LLMs, where questions trigger conflicting, ambiguous, or unhelpful web search results.

Key takeaways:

Frontier LLMs struggle on Seal-0 (SealQA's core set), where most chat LLMs (incl. GPT-4.1 w/ browsing) achieve near-zero accuracy.
More test-time compute does not yield reliable gains: o-series models often plateau or decline early.
Advanced reasoning models (e.g., DeepSeek-R1) can be highly vulnerable to noisy search results.
"Lost-in-the-middle" is less of an issue, but models still fail to reliably identify relevant docs amid distractors.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2506.01062 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.01062 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.