arxiv:2505.19590

Learning to Reason without External Rewards

Published on May 26

· Submitted by

Xuandong on May 27

Upvote

Authors:

Xuandong Zhao ,

Abstract

Intuitor, a Reinforcement Learning from Internal Feedback method, uses self-certainty as a reward signal to enable unsupervised learning of large language models, achieving performance comparable to GRPO on benchmarks and superior generalization.

AI-generated summary

Training large language models (LLMs) for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

View arXiv page View PDF GitHub repository Add to collection