Abstract
Intuitor, a Reinforcement Learning from Internal Feedback method, uses self-certainty as a reward signal to enable unsupervised learning of large language models, achieving performance comparable to GRPO on benchmarks and superior generalization.
Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Right Question is Already Half the Answer: Fully Unsupervised LLM Reasoning Incentivization (2025)
- Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? (2025)
- TTRL: Test-Time Reinforcement Learning (2025)
- Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards (2025)
- NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning (2025)
- Reward Reasoning Model (2025)
- Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper