Zkkkai/CPGD-7B · Hugging Face

CPGD: Toward Stable Rule-based Reinforcement Learning for Language Models

We proposed a novel RL algorithm called Clipped Policy Gradient Optimization with Policy Drift (CPGD), which is based on policy gradient loss with a clipping mechanism and a policy drift regularizer. In our experiments, we found that it is more stable and performs better than GRPO.

🤖 Models

Based on the key factors identified by https://github.com/ModalMinds/MM-EUREKA for achieving stable training, we enhanced the model, dataset, and algorithmic modules. Specifically, we maintained the strategy of omitting the KL divergence term and applying data filtering, while implementing the following critical modifications:

The base model was upgraded from InternVL2.5-8B-Instruct to the more powerful Qwen2.5-VL-7B-Instruct.
The Vision Transformer (ViT) module was frozen during training.
The underlying RL algorithm was replaced with GRPO, instead of the previously used RLOO.
The data filtering strategy was transitioned from an offline approach to an online approach.
Additional data from the K12 dataset was collected, expanding the total dataset size to 15,000 samples.

Model	MathVista	MathVerse	MathVision	OlympiadBench	WeMath	MMK12
Claude3.7-Sonnet	66.8	52.0	41.3	48.9	72.6	55.3
GPT-4o	63.8	50.2	30.4	35.0	68.8	49.9
o1	73.9	57.0	60.3	68.0	98.7	73.9
Gemini2-flash	70.4	59.3	41.3	51.0	71.4	65.2
Qwen-2.5-VL-7B	68.2	47.9	25.4	20.2	62.1	53.6
Qwen-2.5-VL-32B	74.7/71.7	49.9	40.1	30.0	69.1	66.8
Qwen-2.5-VL-72B	74.8	57.6	38.1	40.4	72.4	70.5
InternVL2.5-VL-78B	72.3	51.7	32.2	31.1	66.3	61.6
QVQ-72B-Preview	71.4	48.2	35.9	33.2	65.4	61.5
Adora-7B	73.5	50.1	23.0	20.1	64.2	58.1
R1-Onevision-7B	64.1	47.1	29.9/23.5	17.3	61.8	39.8
MM-Eureka-Qwen-7B	73.0	50.3	26.9	20.1	66.1	64.5
MM-Eureka-Qwen-32B	74.8	56.5	34.4	35.9	73.4	72.2
MM-Eureka-CPGD-Qwen-7B	74.0	50.6	28.3	21.4	68.3	65.3

Zkkkai
/

CPGD-7B

🤖 Models

Model tree for Zkkkai/CPGD-7B