arxiv:2509.04664

Why Language Models Hallucinate

Published on Sep 4

· Submitted by

taesiri on Sep 8

#1 Paper of the day

Upvote

133

Authors:

Abstract

Language models produce incorrect statements due to training and evaluation procedures that reward guessing over acknowledging uncertainty, leading to a need for socio-technical changes in benchmark scoring.

AI-generated summary

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

View arXiv page View PDF Add to collection

Community

taesiri

Paper submitter 2 days ago

Like students facing hard exam questions, large language models sometimes guess when uncertain, producing plausible yet incorrect statements instead of admitting uncertainty. Such "hallucinations" persist even in state-of-the-art systems and undermine trust. We argue that language models hallucinate because the training and evaluation procedures reward guessing over acknowledging uncertainty, and we analyze the statistical causes of hallucinations in the modern training pipeline. Hallucinations need not be mysterious -- they originate simply as errors in binary classification. If incorrect statements cannot be distinguished from facts, then hallucinations in pretrained language models will arise through natural statistical pressures. We then argue that hallucinations persist due to the way most evaluations are graded -- language models are optimized to be good test-takers, and guessing when uncertain improves test performance. This "epidemic" of penalizing uncertain responses can only be addressed through a socio-technical mitigation: modifying the scoring of existing benchmarks that are misaligned but dominate leaderboards, rather than introducing additional hallucination evaluations. This change may steer the field toward more trustworthy AI systems.

MichaelBarryUK

2 days ago

•

edited 2 days ago

Whilst this may serve a powerful purpose we should be careful as to what stage it is deployed. For it may be a philosophical dead end. The idea of "Certainty-Constrained Learning" is an oxymoron.

Uncertainty is the precursor of Certainty. You cannot be certain without first being uncertain. Learning is the transition from uncertainty to certainty. "Hallucination" is the path. Without that path, there can be no learning.

If we constrain a student to only answer when he is certain then he will forever be uncertain.

I believe "hallucination" is not something to eradicate, for it is the very thing that enables learning in the first place.

I think we need to think very carefully about where, when, and why, to deploy this method. That alone would be a paper in its own right.

That's not to detract from the research, which is solid. Just something to think about.

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Corral7305

1 day ago

•

edited 1 day ago

i'd argue that llms always hallucinate, its how they retrieve information, by making up the next word based on probabilities. its just sometimes the retrieved info is correct and sometimes its made up. for the llm itself its the same thing as it does not "know" anything.

imho, as long as the architecture wont change, we can only pseudo-patch it by doing a multi-model dance where one model with a different temperature ~~checking~~ setting kinda plays the supervisor. or obviously more sophisticated methods, which my small brains cant comprehend.

kzawistowsk

1 day ago

•

edited about 19 hours ago

(...)

Corral7305

1 day ago

•

edited 1 day ago

i kinda wonder if the SLM (small language model) might be a better way. have small models that are real experts on only a subset of the knowledge. like one that is good at python, one that is good at ruby, one that is good at cars, anything, really.
and then have models that can route to the SLMs when needed. if its a math problem, route it to the SLM that is an expert in R or whatever. let that one write the code to solve that problem. if its an image classification problem route to another SLM..

i kinda saw this in chat gpt, where it actually pulled out the python to solve some simple math. it made a small python script to actually calculate the solution.

i think this would solve a ton of issues like the models get bigger and bigger and where we gonna put them? do we even need (made up example) poetics or knowledge about music theory when we want to solve integrals?

i know, there is already attempts to have mixture of expert models, that kinda do this, but i believe it would be beneficial to extend that in a big way. i think, there is still some technical limitation just by the size of the VRAM. Even the open source models like Kimi 2 or so, the quantized versions need like 1.2TB of VRAM, so i dont want to know what GPT 5 needs to run. so, having multiple smaller models in a "microservice like" setup should solve a ton of those issues.

they dont even need to be SLMs in that sense. we could really train the experts and make them a trillion params model just for understanding coding paradigms or coding in a specific language..

but i guess i just diverted from the actual point of this paper.. sry for that.

MichaelBarryUK

1 day ago

•

edited 1 day ago

Choosing a "microservice" architecture over a monolithic architecture, has its own pros and cons, but the biggest downside is that you can kiss goodbye to cross-domain generation. That's a deal breaker. We'll take hallucination over uni-domain generation everyday of the week. Cross-domain generation is what makes these models so powerful.

I predict that the best approach will be a democracy of models. Many monolithic models trained on different mixtures and combinations of overlapping datasets that cast votes, perhaps with some natural selection pressures thrown in for good measure. Humanity has yet to solve the problem of hallucination in the real world, so why should we believe that computer scientists will succeed where every human in existence has failed. Hence, democracy.