File size: 7,144 Bytes
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cececd3
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
4d9797b
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
4d9797b
e091fc7
4d9797b
 
 
 
 
 
 
 
 
e091fc7
 
 
4d9797b
 
 
 
 
 
 
 
 
 
 
 
 
e091fc7
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
---
license: mit
tags:
- RLinf
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
pipeline_tag: reinforcement-learning
model-index:
- name: RLinf-math-7B
  results:
  - task:
      type: math             # Required. Example: automatic-speech-recognition
    dataset:
      type: aime_2024          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: AIME24         # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 68.328125      # Required. Example: 20.90
  - task:
      type: math             # Required. Example: automatic-speech-recognition
    dataset:
      type: aime_2025          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: AIME25        # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 52.19375      # Required. Example: 20.90
  - task:
      type: stem             # Required. Example: automatic-speech-recognition
    dataset:
      type: gpqa_diamond          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: GPQA-diamond         # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 48.178124999999994      # Required. Example: 20.90
---

<div align="center">
  <img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>


<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->
</div>

<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>

[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.


<div align="center">
  <img src="overview.png" alt="RLinf-overview" width="600"/>
</div>

## Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

## Evaluation and Results
We trained and evaluated two models using RLinf:

- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
  - Recommended sampling settings:  `temperature = 0.6`, `top_p = 0.95`

- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
  - Recommended sampling settings:  `temperature = 1.0`, `top_p = 0.95`

### Benchmark Results

**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL. 

| Model                                      | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33     | 24.90     | 27.45        | 26.89     |
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B)                             | 37.80     | 30.42     | 32.11        | 33.44     |
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)                    | 40.41     | 30.93     | 27.54        | 32.96     |
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3)                 | 40.73     | 31.56     | 28.10        | 33.46     |
| AReaL-1.5B-retrain*                        | 44.42     | 34.27     | 33.81        | 37.50     |
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3)                          | 43.65     | 32.49     | 35.00        | 37.05     |
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B)                           | **48.44** | **35.63** | **38.46**    | **40.84** |

\* We retrain the model using the default settings for 600 steps. 

**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL. 

| Model                                    | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)  | 54.90     | 40.20     | 45.48        | 46.86     |
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B)                           | 61.66     | 49.38     | 46.93        | 52.66     |
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B)                           | 66.87     | 52.49     | 44.43        | 54.60     |
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview)                    | **68.55** | 51.24     | 43.88        | 54.56     |
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B)                   | 67.30     | **55.00** | 45.57        | 55.96     |
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B)                            | 68.33     | 52.19     | **48.18**    | **56.23** |



## How to Use
Example with Hugging Face `transformers`:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,   # recommended for 7B
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License
This code repository and the model weights are licensed under the MIT License.