File size: 7,144 Bytes

---
license: mit
tags:
- RLinf
language:
- en
metrics:
- accuracy
base_model:
- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
pipeline_tag: reinforcement-learning
model-index:
- name: RLinf-math-7B
  results:
  - task:
      type: math             # Required. Example: automatic-speech-recognition
    dataset:
      type: aime_2024          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: AIME24         # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 68.328125      # Required. Example: 20.90
  - task:
      type: math             # Required. Example: automatic-speech-recognition
    dataset:
      type: aime_2025          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: AIME25        # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 52.19375      # Required. Example: 20.90
  - task:
      type: stem             # Required. Example: automatic-speech-recognition
    dataset:
      type: gpqa_diamond          # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
      name: GPQA-diamond         # Required. A pretty name for the dataset. Example: Common Voice (French)
    metrics:
      - type: accuracy        # Required. Example: wer. Use metric id from https://hf.co/metrics
        value: 48.178124999999994      # Required. Example: 20.90
---

<div align="center">
  <img src="logo.svg" alt="RLinf-logo" width="500"/>
</div>


<div align="center">
<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->
</div>

<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>

[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.


<div align="center">
  <img src="overview.png" alt="RLinf-overview" width="600"/>
</div>

## Model Description
The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

## Evaluation and Results
We trained and evaluated two models using RLinf:

- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
  - Recommended sampling settings:  `temperature = 0.6`, `top_p = 0.95`

- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
  - Recommended sampling settings:  `temperature = 1.0`, `top_p = 0.95`

### Benchmark Results

**1.5B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL. 

| Model                                      | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ------------------------------------------ | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | 28.33     | 24.90     | 27.45        | 26.89     |
| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B)                             | 37.80     | 30.42     | 32.11        | 33.44     |
| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview)                    | 40.41     | 30.93     | 27.54        | 32.96     |
| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3)                 | 40.73     | 31.56     | 28.10        | 33.46     |
| AReaL-1.5B-retrain*                        | 44.42     | 34.27     | 33.81        | 37.50     |
| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3)                          | 43.65     | 32.49     | 35.00        | 37.05     |
| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B)                           | **48.44** | **35.63** | **38.46**    | **40.84** |

\* We retrain the model using the default settings for 600 steps. 

**7B models**. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL. 

| Model                                    | AIME 24   | AIME 25   | GPQA-diamond | Average   |
| ---------------------------------------- | --------- | --------- | ------------ | --------- |
| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B)  | 54.90     | 40.20     | 45.48        | 46.86     |
| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B)                           | 61.66     | 49.38     | 46.93        | 52.66     |
| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B)                           | 66.87     | 52.49     | 44.43        | 54.60     |
| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview)                    | **68.55** | 51.24     | 43.88        | 54.56     |
| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B)                   | 67.30     | **55.00** | 45.57        | 55.96     |
| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B)                            | 68.33     | 52.19     | **48.18**    | **56.23** |



## How to Use
Example with Hugging Face `transformers`:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "RLinf/RLinf-math-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(
    **inputs,
    max_new_tokens=512,
    temperature=1.0,   # recommended for 7B
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License
This code repository and the model weights are licensed under the MIT License.