File size: 7,144 Bytes
e091fc7 cececd3 e091fc7 4d9797b e091fc7 4d9797b e091fc7 4d9797b e091fc7 4d9797b e091fc7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 |
---
license: mit
tags:
- RLinf
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
pipeline_tag: reinforcement-learning
model-index:
- name: RLinf-math-7B
results:
- task:
type: math # Required. Example: automatic-speech-recognition
dataset:
type: aime_2024 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: AIME24 # Required. A pretty name for the dataset. Example: Common Voice (French)
metrics:
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 68.328125 # Required. Example: 20.90
- task:
type: math # Required. Example: automatic-speech-recognition
dataset:
type: aime_2025 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: AIME25 # Required. A pretty name for the dataset. Example: Common Voice (French)
metrics:
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 52.19375 # Required. Example: 20.90
- task:
type: stem # Required. Example: automatic-speech-recognition
dataset:
type: gpqa_diamond # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
name: GPQA-diamond # Required. A pretty name for the dataset. Example: Common Voice (French)
metrics:
- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
value: 48.178124999999994 # Required. Example: 20.90
---
<div align="center">
<img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>
<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&"></a> -->
</div>
<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>
[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.
<div align="center">
<img src="overview.png" alt="RLinf-overview" width="600"/>
</div>
## Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.
We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.
## Evaluation and Results
We trained and evaluated two models using RLinf:
- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
- Recommended sampling settings: `temperature = 0.6`, `top_p = 0.95`
- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
- Recommended sampling settings: `temperature = 1.0`, `top_p = 0.95`
### Benchmark Results
**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33 | 24.90 | 27.45 | 26.89 |
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) | 37.80 | 30.42 | 32.11 | 33.44 |
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) | 40.41 | 30.93 | 27.54 | 32.96 |
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) | 40.73 | 31.56 | 28.10 | 33.46 |
| AReaL-1.5B-retrain* | 44.42 | 34.27 | 33.81 | 37.50 |
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) | 43.65 | 32.49 | 35.00 | 37.05 |
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) | **48.44** | **35.63** | **38.46** | **40.84** |
\* We retrain the model using the default settings for 600 steps.
**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.
| Model | AIME 24 | AIME 25 | GPQA-diamond | Average |
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 54.90 | 40.20 | 45.48 | 46.86 |
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) | 61.66 | 49.38 | 46.93 | 52.66 |
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) | 66.87 | 52.49 | 44.43 | 54.60 |
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) | **68.55** | 51.24 | 43.88 | 54.56 |
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) | 67.30 | **55.00** | 45.57 | 55.96 |
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) | 68.33 | 52.19 | **48.18** | **56.23** |
## How to Use
Example with Hugging Face `transformers`:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")
prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=512,
temperature=1.0, # recommended for 7B
top_p=0.95
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```
## License
This code repository and the model weights are licensed under the MIT License.
|