CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
Abstract
A novel reinforcement learning algorithm, CPGD, stabilizes policy learning in language models by constraining policy drift and clipping updates, improving performance and stability.
Recent advances in rule-based reinforcement learning (RL) have significantly improved the reasoning capability of language models (LMs) with rule-based rewards. However, existing RL methods -- such as GRPO, REINFORCE++, and RLOO -- often suffer from training instability, where large policy updates and improper clipping can lead to training collapse. To address this issue, we propose Clipped Policy Gradient Optimization with Policy Drift (CPGD), a novel algorithm designed to stabilize policy learning in LMs. CPGD introduces a policy drift constraint based on KL divergence to dynamically regularize policy updates, and leverages a clip mechanism on the logarithm of the ratio to prevent excessive policy updates. We provide theoretical justification for CPGD and demonstrate through empirical analysis that it mitigates the instability observed in prior approaches. Furthermore, we show that CPGD significantly improves performance while maintaining training stability. Our implementation balances theoretical rigor with practical usability, offering a robust alternative for RL in the post-training of LMs. We release our code at https://github.com/ModalMinds/MM-EUREKA.
Community
We propose a novel rule-based RL algorithm to solve the training instability issue in existing RL methods.
Paper: https://arxiv.org/abs/2505.12504
Code: https://github.com/ModalMinds/MM-EUREKA
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GPG: A Simple and Strong Reinforcement Learning Baseline for Model Reasoning (2025)
- IN-RIL: Interleaved Reinforcement and Imitation Learning for Policy Fine-Tuning (2025)
- Kalman Filter Enhanced GRPO for Reinforcement Learning-Based Language Model Reasoning (2025)
- Learning to Reason under Off-Policy Guidance (2025)
- DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training (2025)
- Not All Rollouts are Useful: Down-Sampling Rollouts in LLM Reinforcement Learning (2025)
- Entropy-guided sequence weighting for efficient exploration in RL-based LLM fine-tuning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper