Exploring Data Scaling Trends and Effects in Reinforcement Learning from Human Feedback
Abstract
Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning large language models with human preferences. While recent research has focused on algorithmic improvements, the importance of prompt-data construction has been overlooked. This paper addresses this gap by exploring data-driven bottlenecks in RLHF performance scaling, particularly reward hacking and decreasing response diversity. We introduce a hybrid reward system combining reasoning task verifiers (RTV) and a generative reward model (GenRM) to mitigate reward hacking. We also propose a novel prompt-selection method, Pre-PPO, to maintain response diversity and enhance learning effectiveness. Additionally, we find that prioritizing mathematical and coding tasks early in RLHF training significantly improves performance. Experiments across two model sizes validate our methods' effectiveness and scalability. Results show that RTV is most resistant to reward hacking, followed by GenRM with ground truth, and then GenRM with SFT Best-of-N responses. Our strategies enable rapid capture of subtle task-specific distinctions, leading to substantial improvements in overall RLHF performance. This work highlights the importance of careful data construction and provides practical methods to overcome performance barriers in RLHF.
Community
Explore Data Scaling for RLHF from Seed Posttrain Team
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Reinforcement Learning is all You Need (2025)
- Reward Shaping to Mitigate Reward Hacking in RLHF (2025)
- Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in Large Language Models (2025)
- IPO: Your Language Model is Secretly a Preference Classifier (2025)
- AURORA:Automated Training Framework of Universal Process Reward Models via Ensemble Prompting and Reverse Verification (2025)
- Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? (2025)
- A Survey of Direct Preference Optimization (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper