CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models
We proposed a novel RL algorithm called Clipped Policy Gradient Optimization with Policy Drift (CPGD)
, which is based on policy gradient loss with a clipping mechanism and a policy drift regularizer. In our experiments, we found that it is more stable and performs better than GRPO.
- π Report: CPGD-Report, CPGD-arxiv
- π€ Model: MM-Eureka-CPGD-Qwen-7B
- πCode: MM-Eureka-Qwen-Code
π€ Models
Based on the key factors identified by https://github.com/ModalMinds/MM-EUREKA for achieving stable training, we enhanced the model, dataset, and algorithmic modules. Specifically, we maintained the strategy of omitting the KL divergence term and applying data filtering, while implementing the following critical modifications:
- The base model was upgraded from InternVL2.5-8B-Instruct to the more powerful Qwen2.5-VL-7B-Instruct.
- The Vision Transformer (ViT) module was frozen during training.
- The underlying RL algorithm was replaced with GRPO, instead of the previously used RLOO.
- The data filtering strategy was transitioned from an offline approach to an online approach.
- Additional data from the K12 dataset was collected, expanding the total dataset size to 15,000 samples.
Model | MathVista | MathVerse | MathVision | OlympiadBench | WeMath | MMK12 |
---|---|---|---|---|---|---|
Claude3.7-Sonnet | 66.8 | 52.0 | 41.3 | 48.9 | 72.6 | 55.3 |
GPT-4o | 63.8 | 50.2 | 30.4 | 35.0 | 68.8 | 49.9 |
o1 | 73.9 | 57.0 | 60.3 | 68.0 | 98.7 | 73.9 |
Gemini2-flash | 70.4 | 59.3 | 41.3 | 51.0 | 71.4 | 65.2 |
Qwen-2.5-VL-7B | 68.2 | 47.9 | 25.4 | 20.2 | 62.1 | 53.6 |
Qwen-2.5-VL-32B | 74.7/71.7 | 49.9 | 40.1 | 30.0 | 69.1 | 66.8 |
Qwen-2.5-VL-72B | 74.8 | 57.6 | 38.1 | 40.4 | 72.4 | 70.5 |
InternVL2.5-VL-78B | 72.3 | 51.7 | 32.2 | 31.1 | 66.3 | 61.6 |
QVQ-72B-Preview | 71.4 | 48.2 | 35.9 | 33.2 | 65.4 | 61.5 |
Adora-7B | 73.5 | 50.1 | 23.0 | 20.1 | 64.2 | 58.1 |
R1-Onevision-7B | 64.1 | 47.1 | 29.9/23.5 | 17.3 | 61.8 | 39.8 |
MM-Eureka-Qwen-7B | 73.0 | 50.3 | 26.9 | 20.1 | 66.1 | 64.5 |
MM-Eureka-Qwen-32B | 74.8 | 56.5 | 34.4 | 35.9 | 73.4 | 72.2 |
MM-Eureka-CPGD-Qwen-7B | 74.0 | 50.6 | 28.3 | 21.4 | 68.3 | 65.3 |
- π€ MM-Eureka-Qwen-7B
- π€ MM-Eureka-Qwen-32B
- π€ MM-Eureka-CPGD-Qwen-7B
- Downloads last month
- 19
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support