RLinf-math-7B / README.md

Update README.md

cececd3 verified 11 days ago

7.14 kB

	---
	license: mit
	tags:
	- RLinf
	language:
	- en
	metrics:
	- accuracy
	base_model:
	- deepseek-ai/DeepSeek-R1-Distill-Qwen-7B
	pipeline_tag: reinforcement-learning
	model-index:
	- name: RLinf-math-7B
	results:
	- task:
	type: math # Required. Example: automatic-speech-recognition
	dataset:
	type: aime_2024 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
	name: AIME24 # Required. A pretty name for the dataset. Example: Common Voice (French)
	metrics:
	- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
	value: 68.328125 # Required. Example: 20.90
	- task:
	type: math # Required. Example: automatic-speech-recognition
	dataset:
	type: aime_2025 # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
	name: AIME25 # Required. A pretty name for the dataset. Example: Common Voice (French)
	metrics:
	- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
	value: 52.19375 # Required. Example: 20.90
	- task:
	type: stem # Required. Example: automatic-speech-recognition
	dataset:
	type: gpqa_diamond # Required. Example: common_voice. Use dataset id from https://hf.co/datasets
	name: GPQA-diamond # Required. A pretty name for the dataset. Example: Common Voice (French)
	metrics:
	- type: accuracy # Required. Example: wer. Use metric id from https://hf.co/metrics
	value: 48.178124999999994 # Required. Example: 20.90
	---

	<div align="center">
	<img src="logo.svg" alt="RLinf-logo" width="500"/>
	</div>


	<div align="center">
	<!-- <a href="TODO"><img src="https://img.shields.io/badge/arXiv-Paper-red?logo=arxiv"></a> -->
	<!-- <a href="TODO"><img src="https://img.shields.io/badge/HuggingFace-yellow?logo=huggingface&logoColor=white" alt="Hugging Face"></a> -->
	<a href="https://github.com/RLinf/RLinf"><img src="https://img.shields.io/badge/Github-blue"></a>
	<a href="https://rlinf.readthedocs.io/en/latest/"><img src="https://img.shields.io/badge/Documentation-Purple?color=8A2BE2&logo=readthedocs"></a>
	<!-- <a href="TODO"><img src="https://devin.ai/assets/deepwiki-badge.png" alt="Ask DeepWiki.com" style="height:20px;"></a>
	<a href="TODO"><img src="https://img.shields.io/badge/微信-green?logo=wechat&amp"></a> -->
	</div>

	<h1 align="center">RLinf: Reinforcement Learning Infrastructure for Agentic AI</h1>

	[RLinf](https://github.com/RLinf/RLinf) is a flexible and scalable open-source infrastructure designed for post-training foundation models (LLMs, VLMs, VLAs) via reinforcement learning. The 'inf' in RLinf stands for Infrastructure, highlighting its role as a robust backbone for next-generation training. It also stands for Infinite, symbolizing the system’s support for open-ended learning, continuous generalization, and limitless possibilities in intelligence development.


	<div align="center">
	<img src="overview.png" alt="RLinf-overview" width="600"/>
	</div>

	## Model Description
	The RLinf-math series is trained on DeepSeek-R1-Distill-Qwen (1.5B and 7B variants), using the same base models and training datasets as AReaL. Training with RLinf yields SOTA performance.

	We adopt Group Relative Policy Optimization (GRPO) with token-level loss aggregation, focusing on mathematical reasoning and long chain-of-thought (CoT) tasks.

	## Evaluation and Results
	We trained and evaluated two models using RLinf:

	- RLinf-math-1.5B Model (based on DeepSeek-R1-Distill-Qwen-1.5B)
	- Recommended sampling settings: `temperature = 0.6`, `top_p = 0.95`

	- RLinf-math-7B Model (based on DeepSeek-R1-Distill-Qwen-7B)
	- Recommended sampling settings: `temperature = 1.0`, `top_p = 0.95`

	### Benchmark Results

	1.5B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-1.5B using RL.

	\| Model \| AIME 24 \| AIME 25 \| GPQA-diamond \| Average \|
	\| ------------------------------------------ \| --------- \| --------- \| ------------ \| --------- \|
	\| [DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) \| 28.33 \| 24.90 \| 27.45 \| 26.89 \|
	\| [DeepMath-1.5B](https://huggingface.co/zwhe99/DeepMath-1.5B) \| 37.80 \| 30.42 \| 32.11 \| 33.44 \|
	\| [DeepScaleR-1.5B-Preview](https://huggingface.co/agentica-org/DeepScaleR-1.5B-Preview) \| 40.41 \| 30.93 \| 27.54 \| 32.96 \|
	\| [AReaL-1.5B-Preview-Stage-3](https://huggingface.co/inclusionAI/AReaL-1.5B-Preview-Stage-3) \| 40.73 \| 31.56 \| 28.10 \| 33.46 \|
	\| AReaL-1.5B-retrain* \| 44.42 \| 34.27 \| 33.81 \| 37.50 \|
	\| [FastCuRL-1.5B-V3](https://huggingface.co/Nickyang/FastCuRL-1.5B-V3) \| 43.65 \| 32.49 \| 35.00 \| 37.05 \|
	\| [RLinf-math-1.5B](https://huggingface.co/RLinf/RLinf-math-1.5B) \| 48.44 \| 35.63 \| 38.46 \| 40.84 \|

	\* We retrain the model using the default settings for 600 steps.

	7B models. All models except the base model are trained upon DeepSeek-R1-Distill-Qwen-7B using RL.

	\| Model \| AIME 24 \| AIME 25 \| GPQA-diamond \| Average \|
	\| ---------------------------------------- \| --------- \| --------- \| ------------ \| --------- \|
	\| [DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) \| 54.90 \| 40.20 \| 45.48 \| 46.86 \|
	\| [AReaL-boba-RL-7B](https://huggingface.co/inclusionAI/AReaL-boba-RL-7B) \| 61.66 \| 49.38 \| 46.93 \| 52.66 \|
	\| [Skywork-OR1-7B](https://huggingface.co/Skywork/Skywork-OR1-7B) \| 66.87 \| 52.49 \| 44.43 \| 54.60 \|
	\| [Polaris-7B-Preview](https://huggingface.co/POLARIS-Project/Polaris-7B-Preview) \| 68.55 \| 51.24 \| 43.88 \| 54.56 \|
	\| [AceMath-RL-Nemotron-7B](https://huggingface.co/nvidia/AceMath-RL-Nemotron-7B) \| 67.30 \| 55.00 \| 45.57 \| 55.96 \|
	\| [RLinf-math-7B](https://huggingface.co/RLinf/RLinf-math-7B) \| 68.33 \| 52.19 \| 48.18 \| 56.23 \|



	## How to Use
	Example with Hugging Face `transformers`:

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "RLinf/RLinf-math-7B"
	tokenizer = AutoTokenizer.from_pretrained(model_name)
	model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

	prompt = "Solve: If x^2 + 2x + 1 = 0, what is x?"

	inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=1.0, # recommended for 7B
	top_p=0.95
	)

	print(tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	## License
	This code repository and the model weights are licensed under the MIT License.