Abstract
Language models produce incorrect statements due to training and evaluation procedures that reward guessing over acknowledging uncertainty, leading to a need for socio-technical changes in benchmark scoring.
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
Community
Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.
Whilst this may serve a powerful purpose we should be careful as to what stage it is deployed. For it may be a philosophical dead end. The idea of "Certainty-Constrained Learning" is an oxymoron.
Uncertainty is the precursor of Certainty. You cannot be certain without first being uncertain. Learning is the transition from uncertainty to certainty. "Hallucination" is the path. Without that path, there can be no learning.
If we constrain a student to only answer when he is certain then he will forever be uncertain.
I believe "hallucination" is not something to eradicate, for it is the very thing that enables learning in the first place.
I think we need to think very carefully about where, when, and why, to deploy this method. That alone would be a paper in its own right.
That's not to detract from the research, which is solid. Just something to think about.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Theoretical Foundations and Mitigation of Hallucination in Large Language Models (2025)
- Counterfactual Probing for Hallucination Detection and Mitigation in Large Language Models (2025)
- Real-Time Detection of Hallucinated Entities in Long-Form Generation (2025)
- Exploring and Mitigating Fawning Hallucinations in Large Language Models (2025)
- ConfTuner: Training Large Language Models to Express Their Confidence Verbally (2025)
- Do Retrieval Augmented Language Models Know When They Don't Know? (2025)
- Harnessing RLHF for Robust Unanswerability Recognition and Trustworthy Response Generation in LLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
i'd argue that llms always hallucinate, its how they retrieve information, by making up the next word based on probabilities. its just sometimes the retrieved info is correct and sometimes its made up. for the llm itself its the same thing as it does not "know" anything.
imho, as long as the architecture wont change, we can only pseudo-patch it by doing a multi-model dance where one model with a different temperature checking setting kinda plays the supervisor. or obviously more sophisticated methods, which my small brains cant comprehend.
(...)
i kinda wonder if the SLM (small language model) might be a better way. have small models that are real experts on only a subset of the knowledge. like one that is good at python, one that is good at ruby, one that is good at cars, anything, really.
and then have models that can route to the SLMs when needed. if its a math problem, route it to the SLM that is an expert in R or whatever. let that one write the code to solve that problem. if its an image classification problem route to another SLM..
i kinda saw this in chat gpt, where it actually pulled out the python to solve some simple math. it made a small python script to actually calculate the solution.
i think this would solve a ton of issues like the models get bigger and bigger and where we gonna put them? do we even need (made up example) poetics or knowledge about music theory when we want to solve integrals?
i know, there is already attempts to have mixture of expert models, that kinda do this, but i believe it would be beneficial to extend that in a big way. i think, there is still some technical limitation just by the size of the VRAM. Even the open source models like Kimi 2 or so, the quantized versions need like 1.2TB of VRAM, so i dont want to know what GPT 5 needs to run. so, having multiple smaller models in a "microservice like" setup should solve a ton of those issues.
they dont even need to be SLMs in that sense. we could really train the experts and make them a trillion params model just for understanding coding paradigms or coding in a specific language..
but i guess i just diverted from the actual point of this paper.. sry for that.
Choosing a "microservice" architecture over a monolithic architecture, has its own pros and cons, but the biggest downside is that you can kiss goodbye to cross-domain generation. That's a deal breaker. We'll take hallucination over uni-domain generation everyday of the week. Cross-domain generation is what makes these models so powerful.
I predict that the best approach will be a democracy of models. Many monolithic models trained on different mixtures and combinations of overlapping datasets that cast votes, perhaps with some natural selection pressures thrown in for good measure. Humanity has yet to solve the problem of hallucination in the real world, so why should we believe that computer scientists will succeed where every human in existence has failed. Hence, democracy.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper