Floriel-X commited on
Commit
4d9797b
·
verified ·
1 Parent(s): e091fc7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +24 -14
README.md CHANGED
@@ -62,7 +62,7 @@ model-index:
62
  </div>
63
 
64
  ## Model Description
65
- The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields performance better than AReaL.
66
 
67
  We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
68
 
@@ -77,23 +77,33 @@ We trained and evaluated two models using RLinf:
77
 
78
  ### Benchmark Results
79
 
80
- | | AIME24 | AIME25 | GPQA-diamond |
81
- | ---------------------------------------- | ------ | ------ | ------------ |
82
- | [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 |
83
- | [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 |
84
- | [AReaL-1.5B](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.1 |
85
- | AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 |
86
- | [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** |
87
 
88
-
89
- | | AIME24 | AIME25 | GPQA-diamond |
90
- | --------------------------- | ------ | ------ | ------------ |
91
- | [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 |
92
- | [AReaL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 62.82 | 47.29 | 46.54 |
93
- | [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | **68.33** | **52.19** | **48.18** |
 
 
 
94
 
95
  \* We retrain the model using the default settings for 600 steps.
96
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ## How to Use
98
  Example with Hugging Face `transformers`:
99
 
 
62
  </div>
63
 
64
  ## Model Description
65
+ The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
66
 
67
  We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
68
 
 
77
 
78
  ### Benchmark Results
79
 
80
+ **1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
 
 
 
 
 
 
81
 
82
+ | Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
83
+ | ------------------------------------------ | --------- | --------- | ------------ | --------- |
84
+ | [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 |
85
+ | [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 |
86
+ | [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 |
87
+ | [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 |
88
+ | AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
89
+ | [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 |
90
+ | [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** |
91
 
92
  \* We retrain the model using the default settings for 600 steps.
93
 
94
+ **7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
95
+
96
+ | Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
97
+ | ---------------------------------------- | --------- | --------- | ------------ | --------- |
98
+ | [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 |
99
+ | [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 |
100
+ | [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 |
101
+ | [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 |
102
+ | [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 |
103
+ | [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** |
104
+
105
+
106
+
107
  ## How to Use
108
  Example with Hugging Face `transformers`:
109