When Do LLMs Admit Their Mistakes? Understanding the Role of Model Belief in Retraction
Abstract
LLMs rarely retract incorrect answers they believe to be factually correct, but supervised fine-tuning can improve their retraction performance by refining their internal beliefs.
Can large language models (LLMs) admit their mistakes when they should know better? In this work, we define the behavior of acknowledging errors in previously generated answers as "retraction" and aim to understand when and why LLMs choose to retract. We first construct model-specific datasets to evaluate whether a model will retract an incorrect answer that contradicts its own parametric knowledge. While LLMs are capable of retraction, they do so only infrequently. We demonstrate that retraction is closely tied to previously identified indicators of models' internal belief: models fail to retract wrong answers that they "believe" to be factually correct. Steering experiments further demonstrate that internal belief causally influences model retraction. In particular, when the model does not believe its answer, this not only encourages the model to attempt to verify the answer, but also alters attention behavior during self-verification. Finally, we demonstrate that simple supervised fine-tuning significantly improves retraction performance by helping the model learn more accurate internal beliefs. Code and datasets are available on https://github.com/ayyyq/llm-retraction.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reasoning Models Know When They're Right: Probing Hidden States for Self-Verification (2025)
- BARREL: Boundary-Aware Reasoning for Factual and Reliable LRMs (2025)
- Do LLM Evaluators Prefer Themselves for a Reason? (2025)
- Between Underthinking and Overthinking: An Empirical Study of Reasoning Length and correctness in LLMs (2025)
- Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time (2025)
- How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence (2025)
- Identifying and Mitigating the Influence of the Prior Distribution in Large Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 2
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper