Reinforcing General Reasoning without Verifiers
Abstract
A verifier-free method (VeriFree) is introduced to extend reinforcement learning training of large language models to general reasoning domains, improving efficiency and performance compared to verifier-based methods.
The recent paradigm shift towards training large language models (LLMs) using DeepSeek-R1-Zero-style reinforcement learning (RL) on verifiable rewards has led to impressive advancements in code and mathematical reasoning. However, this methodology is limited to tasks where rule-based answer verification is possible and does not naturally extend to real-world domains such as chemistry, healthcare, engineering, law, biology, business, and economics. Current practical workarounds use an additional LLM as a model-based verifier; however, this introduces issues such as reliance on a strong verifier LLM, susceptibility to reward hacking, and the practical burden of maintaining the verifier model in memory during training. To address this and extend DeepSeek-R1-Zero-style training to general reasoning domains, we propose a verifier-free method (VeriFree) that bypasses answer verification and instead uses RL to directly maximize the probability of generating the reference answer. We compare VeriFree with verifier-based methods and demonstrate that, in addition to its significant practical benefits and reduced compute requirements, VeriFree matches and even surpasses verifier-based methods on extensive evaluations across MMLU-Pro, GPQA, SuperGPQA, and math-related benchmarks. Moreover, we provide insights into this method from multiple perspectives: as an elegant integration of training both the policy and implicit verifier in a unified model, and as a variational optimization approach. Code is available at https://github.com/sail-sg/VeriFree.
Community
We improve LLM's general reasoning without relying on verifiers.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- General-Reasoner: Advancing LLM Reasoning Across All Domains (2025)
- NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning (2025)
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2025)
- Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards (2025)
- Learning to Reason without External Rewards (2025)
- Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning (2025)
- Interleaved Reasoning for Large Language Models via Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper