Confidence Is All You Need: Few-Shot RL Fine-Tuning of Language Models
Abstract
Reinforcement Learning via Self-Confidence (RLSC) improves large language model accuracy using the model's confidence as a reward signal, eliminating the need for human labels or reward engineering.
Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.
Community
Large language models (LLMs) excel at reasoning, yet post-training remains critical for aligning their behavior with task goals. Existing reinforcement learning (RL) methods often depend on costly human annotations or external reward models. We propose Reinforcement Learning via Self-Confidence (RLSC), which uses the model's own confidence as reward signals-eliminating the need for labels, preference models, or reward engineering. Applied to Qwen2.5-Math-7B with only 16 samples per question and 10 or 20 training steps, RLSC improves accuracy by +13.4% on AIME2024, +21.2% on MATH500, +21.7% on Minerva Math, +20.8% on Olympiadbench, and +9.7% on AMC23. RLSC provides a simple, scalable post-training method for inference models, requiring only a small number of samples and unlabelled supervision.
Thanks, quite interesting RL approach
Smart
Hi, thanks for you question!!
Our approach is fundamentally different from the work you mentioned in terms of the definition of confidence. The approach you mention is similar to existing approaches (https://arxiv.org/abs/2505.20282) in that it tends to base confidence on the prediction of the next Token. Our definition, on the other hand, looks at the confidence of the entire response. We were inspired by TTRL (https://arxiv.org/pdf/2504.16084), whose theoretical core lies in changing the distribution of modeled answers. However, TTRL requires complex construction of pseudo-labels to achieve this. In contrast, we propose a simple yet effective method that also achieves the same adjustment of the answer data distribution, but without the tedious process of constructing pseudo-labels.
Our approach is based on strict mathematical reasoning and employs the Policy Optimization framework for derivation and implementation.
very nice
What do you mean under overconfidence? Biasing on specific benchmarks?
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Behavior Injection: Preparing Language Models for Reinforcement Learning (2025)
- Unsupervised Post-Training for Multi-Modal LLM Reasoning via GRPO (2025)
- Mixed-R1: Unified Reward Perspective For Reasoning Capability in Multimodal Large Language Models (2025)
- Learning to Clarify by Reinforcement Learning Through Reward-Weighted Fine-Tuning (2025)
- Bingo: Boosting Efficient Reasoning of LLMs via Dynamic and Significance-based Reinforcement Learning (2025)
- The Climb Carves Wisdom Deeper Than the Summit: On the Noisy Rewards in Learning to Reason (2025)
- KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
What a very clever way to do it. Excellent!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper