nvidia
/

AceReason-Nemotron-7B

@@ -21,8 +21,23 @@ tags:
 # AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
  <img src="fig/main_fig.png" alt="main_fig" style="width: 600px; max-width: 100%;" />
 We're thrilled to introduce AceReason-Nemotron-7B, a math and code reasoning model trained entirely through reinforcement learning (RL), starting from the DeepSeek-R1-Distilled-Qwen-7B. It delivers impressive results, achieving 69.0% on AIME 2024 (+14.5%), 53.6% on AIME 2025 (+17.4%), 51.8% on LiveCodeBench v5 (+8%), 44.1% on LiveCodeBench v6 (+7%). We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first RL training on math-only prompts, then RL training on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks, but also code reasoning tasks. In addition, extended code-only RL further improves code benchmark performance while causing minimal degradation in math results. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.
 We share our training recipe, training logs in our technical report.
@@ -106,7 +121,10 @@ else:
 final_prompt = "<｜User｜>" + question + "<｜Assistant｜><think>\n"
 ```
 5. Our inference engine for evaluation is **vLLM==0.7.3** using top-p=0.95, temperature=0.6, max_tokens=32768.
-6. We use [AceMath scorer](https://huggingface.co/nvidia/AceMath-7B-Instruct/blob/main/evaluation/grader.py) for math evaluation and [LiveCodeBench official script](https://github.com/LiveCodeBench/LiveCodeBench) for code evaluation.
 ## Correspondence to

 # AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
+<p align="center">
+[![Technical Report](https://img.shields.io/badge/2505.16400-Technical_Report-blue)](https://arxiv.org/abs/2505.16400)
+[![Dataset](https://img.shields.io/badge/🤗-Math_RL_Datset-blue)](https://huggingface.co/datasets/nvidia/AceReason-Math)
+[![Models](https://img.shields.io/badge/🤗-Models-blue)](https://huggingface.co/collections/nvidia/acereason-682f4e1261dc22f697fd1485)
+[![Eval Toolkit](https://img.shields.io/badge/🤗-Eval_Code-blue)](https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/README_EVALUATION.md)
+</p>
  <img src="fig/main_fig.png" alt="main_fig" style="width: 600px; max-width: 100%;" />
+## 🔥News
+- **6/11/2025**: We share our evaluation toolkit at [AceReason Evalution](https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/README_EVALUATION.md) including:
+  - scripts to run inference and scoring
+  - LiveCodeBench (avg@8): model prediction files and scores for each month (2023/5-2025/5)
+  - AIME24/25 (avg@64): model prediction files and scores
+- **6/2/2025**: We are excited to share our Math RL training dataset at [AceReason-Math](https://huggingface.co/datasets/nvidia/AceReason-Math)
 We're thrilled to introduce AceReason-Nemotron-7B, a math and code reasoning model trained entirely through reinforcement learning (RL), starting from the DeepSeek-R1-Distilled-Qwen-7B. It delivers impressive results, achieving 69.0% on AIME 2024 (+14.5%), 53.6% on AIME 2025 (+17.4%), 51.8% on LiveCodeBench v5 (+8%), 44.1% on LiveCodeBench v6 (+7%). We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first RL training on math-only prompts, then RL training on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks, but also code reasoning tasks. In addition, extended code-only RL further improves code benchmark performance while causing minimal degradation in math results. We find that RL not only elicits the foundational reasoning capabilities acquired during pre-training and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.
 We share our training recipe, training logs in our technical report.
 final_prompt = "<｜User｜>" + question + "<｜Assistant｜><think>\n"
 ```
 5. Our inference engine for evaluation is **vLLM==0.7.3** using top-p=0.95, temperature=0.6, max_tokens=32768.
+## Evaluation Toolkit
+Please check evaluation code, scripts, cached prediction files in https://huggingface.co/nvidia/AceReason-Nemotron-14B/blob/main/README_EVALUATION.md
 ## Correspondence to