Update README.md
Browse files
README.md
CHANGED
@@ -62,7 +62,7 @@ model-index:
|
|
62 |
</div>
|
63 |
|
64 |
## Model Description
|
65 |
-
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields performance
|
66 |
|
67 |
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
|
68 |
|
@@ -77,23 +77,33 @@ We trained and evaluated two models using RLinf:
|
|
77 |
|
78 |
### Benchmark Results
|
79 |
|
80 |
-
|
81 |
-
| ---------------------------------------- | ------ | ------ | ------------ |
|
82 |
-
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 |
|
83 |
-
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 |
|
84 |
-
| [AReaL-1.5B](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.1 |
|
85 |
-
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 |
|
86 |
-
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** |
|
87 |
|
88 |
-
|
89 |
-
|
|
90 |
-
|
|
91 |
-
| [
|
92 |
-
| [
|
93 |
-
| [
|
|
|
|
|
|
|
94 |
|
95 |
\* We retrain the model using the default settings for 600 steps.
|
96 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
97 |
## How to Use
|
98 |
Example with Hugging Face `transformers`:
|
99 |
|
|
|
62 |
</div>
|
63 |
|
64 |
## Model Description
|
65 |
+
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
|
66 |
|
67 |
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
|
68 |
|
|
|
77 |
|
78 |
### Benchmark Results
|
79 |
|
80 |
+
**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
|
|
|
|
|
|
|
|
|
|
|
|
|
81 |
|
82 |
+
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
|
83 |
+
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
|
84 |
+
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 |
|
85 |
+
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 |
|
86 |
+
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 |
|
87 |
+
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 |
|
88 |
+
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
|
89 |
+
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 |
|
90 |
+
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** |
|
91 |
|
92 |
\* We retrain the model using the default settings for 600 steps.
|
93 |
|
94 |
+
**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
|
95 |
+
|
96 |
+
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
|
97 |
+
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
|
98 |
+
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 |
|
99 |
+
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 |
|
100 |
+
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 |
|
101 |
+
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 |
|
102 |
+
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 |
|
103 |
+
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** |
|
104 |
+
|
105 |
+
|
106 |
+
|
107 |
## How to Use
|
108 |
Example with Hugging Face `transformers`:
|
109 |
|