---
license: apache-2.0
---
# EurusPRM-Stage2
## Links
- 📜 [Paper](https://arxiv.org/abs/2502.01456)
- 📜 [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
- 🤗 [PRIME Collection](https://huggingface.co/PRIME-RL)
- 🤗 [Training Data](https://huggingface.co/datasets/PRIME-RL/EurusPRM-Stage2-Data)
## Introduction
EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.01981)**, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
✨
***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.
$$
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
$$
Define
$$
q_\phi^t(\mathbf{y}_{ 0$ and $0 \\le \\theta < 2 \\pi.$',
'answer':[
"Step 1: To convert the point (0,3) from rectangular coordinates to polar coordinates, we need to find the radius (r) and the angle theta (\u03b8).",
"Step 2: Find the radius (r). The radius is the distance from the origin (0,0) to the point (0,3). Since the x-coordinate is 0, the distance is simply the absolute value of the y-coordinate. So, r = |3| = 3.",
"Step 3: Find the angle theta (\u03b8). The angle theta is measured counterclockwise from the positive x-axis. Since the point (0,3) lies on the positive y-axis, the angle theta is 90 degrees or \u03c0\/2 radians.",
"Step 4: Write the polar coordinates. The polar coordinates are (r, \u03b8), where r > 0 and 0 \u2264 \u03b8 < 2\u03c0. In this case, r = 3 and \u03b8 = \u03c0\/2.\n\nTherefore, the polar coordinates of the point (0,3) are (3, \u03c0\/2).\n\n\n\\boxed{(3,\\frac{\\pi}{2})}"
]
}
model = AutoModelForCausalLM.from_pretrained('PRIME-RL/EurusPRM-Stage2')
tokenizer = AutoTokenizer.from_pretrained('PRIME-RL/EurusPRM-Stage2')
ref_model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Math-7B-Instruct')
input_ids = tokenizer.apply_chat_template([
{"role": "user", "content": d["query"]},
{"role": "assistant", "content": "\n\n".join(d["answer"])},
], tokenize=True, add_generation_prompt=False,return_tensors='pt')
attention_mask = input_ids!=tokenizer.pad_token_id
step_last_tokens = []
for step_num in range(0, len(d["answer"])+1):
conv = tokenizer.apply_chat_template([
{"role":"user", "content":d["query"]},
{"role":"assistant", "content":"\n\n".join(d["answer"][:step_num])},
], tokenize=False, add_generation_prompt=False)
conv = conv.strip()
if step_num!=0 and step_num!=len(d['answer']):
conv+='\n\n'
currect_ids = tokenizer.encode(conv,add_special_tokens=False)
step_last_tokens.append(len(currect_ids) - 2)
inputs = {'input_ids':input_ids,'attention_mask':attention_mask,'labels':input_ids}
label_mask = torch.tensor([[0]*step_last_tokens[0]+[1]*(input_ids.shape[-1]-step_last_tokens[0])])
step_last_tokens = torch.tensor([step_last_tokens])
def get_logps(model,inputs):
logits = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']).logits
labels = inputs['labels'][:, 1:].clone().long()
logits = logits[:, :-1, :]
labels[labels == -100] = 0
per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
return per_token_logps
with torch.no_grad():
per_token_logps = get_logps(model, inputs)
ref_per_token_logps = get_logps(ref_model,inputs)
raw_reward = per_token_logps - ref_per_token_logps
beta_reward = coef * raw_reward * label_mask[:,1:]
beta_reward = beta_reward.cumsum(-1)
beta_reward = beta_reward.gather(dim=-1, index=step_last_tokens[:,1:])
print(beta_reward)
```
## Evaluation
### Evaluation Code
We use codes in [Implicit PRM](https://github.com/PRIME-RL/ImplicitPRM/tree/main/eval) to evaluate the performance of EurusPRM. The reference model is **Qwen2.5-Math-7B-Instruct**.
### Evaluation Base Model
For **Best-of N Sampling**, we adopt **Eurus-2-7B-SFT**, **Qwen2.5-7B-Instruct** and **Llama-3.1-70B-Instruct** as generation models to evaluate the performance of our implicit PRM. For all models, we set the sampling temperature as 0.5, *p* of the top-*p* sampling as 1.
For **ProcessBench**, we adopt **Math-Shepherd-PRM-7B**, **RLHFlow-PRM-Mistral-8B**, **RLHFlow-PRM-Deepseek-8B**, **Skywork-PRM-7B**, **EurusPRM-Stage 1**, and **EurusPRM-Stage 2**.
### Best-of-N Sampling
We use Best-of-64 as our evaluation metric. The weighting methods are different for several PRMs below.
- For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
- For EurusPRM-Stage 1, we use the minimum reward across all steps.
- For EurusPRM-Stage 2, we use the accumulative rewards.
**Eurus-2-7B-SFT**
| Method | Reward Model | MATH | AMC | AIME_2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 65.1 | 30.1 | 3.3 | 29.8 | 32.7 | 32.2 |
| Majority Voting @ 64 | N/A | 65.6 | 53.0 | 13.3 | 39.1 | 22.4 | 38.7 |
| Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 47.2 | 45.8 | 10.0 | 32.3 | 16.2 | 30.3 |
| | EurusPRM-Stage 1 | 44.6 | 41.0 | 6.7 | 32.9 | 17.3 | 28.5 |
| | EurusPRM-Stage 2 | 47.2 | 43.4 | 13.3 | 33.8 | 19.2 | 31.4 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 64.6 | **55.4** | 13.3 | 41.3 | 23.2 | 39.6 |
| | EurusPRM-Stage 1 | **66.0** | 54.2 | 13.3 | 39.6 | **29.0** | **40.4** |
| | EurusPRM-Stage 2 | **66.0** | 54.2 | 13.3 | **39.7** | **29.0** | **40.4** |
**Llama-3.1-70B-Instruct**
| Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 64.6 | 30.1 | 16.7 | 31.9 | 35.3 | 35.7 |
| Majority Voting @ 64 | N/A | 80.2 | 53.0 | 26.7 | 40.4 | 38.6 | 47.8 |
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 77.8 | 56.6 | 23.3 | 39.0 | 31.6 | 45.7 |
| | EurusPRM-Stage 1 | 77.8 | 44.6 | **26.7** | 35.3 | 41.5 | 45.2 |
| | EurusPRM-Stage 2 | 80.6 | **59.0** | 20.0 | 37.6 | 44.9 | 48.4 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **81.2** | 56.6 | 23.3 | **42.4** | 38.2 | 48.3 |
| | EurusPRM-Stage 1 | 80.4 | 53.0 | **26.7** | 40.9 | **46.7** | **49.5** |
| | EurusPRM-Stage 2 | 80.4 | 53.0 | **26.7** | 41.0 | 46.3 | **49.5** |
**Qwen2.5-7B-Instruct**
| Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
| Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 85.2 | **60.2** | **20.0** | **44.7** | 32.7 | 48.6 |
| | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
| | EurusPRM-Stage 2 | **86.0** | 59.0 | 16.7 | 41.4 | 41.5 | **48.9** |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
| | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | 45.2 | 48.0 |
| | EurusPRM-Stage 2 | 84.8 | 53.0 | 16.7 | 43.2 | **45.6** | 48.7 |
### ProcessBench
We evaluate **EurusPRM-Stage 1** and **EurusPRM-Stage 2** on **ProcessBench**.
The threshold is obtained by converting the original score of each step using sigmoid function and iterating to find the highest F1 on GSM8k sub-benchmark. The threshold for **EurusPRM-Stage 1** and **EurusPRM-Stage 2** is 0.5015 and 0.5005 respectively.
To leverage the capibility of **EurusPRM** better, we add ``Step K`` (where K is the actual index of the step) in front of each step in **ProcessBench**.
| Reward Model | GSM8k | MATH | OlympiadBench | Omni-Math | Avg |
| --- | --- | --- | --- | --- | --- |
| Math-Shepherd-PRM-7B | 47.9 | 29.5 | 24.8 | 23.8 | 31.5 |
| RLHFlow-PRM-Mistral-8B | 50.4 | 33.4 | 13.8 | 15.8 | 28.4 |
| RLHFlow-PRM-Deepseek-8B | 38.8 | 33.8 | 16.9 | 16.9 | 26.6 |
| Skywork-PRM-7B | **70.8** | **53.6** | 22.9 | 21.0 | 42.1 |
| EurusPRM-Stage 1 | 54.7 | 41.2 | 24.7 | 17.5 | 30.6 |
| EurusPRM-Stage 1-no-step | 42.1 | 33.1 | 13.2 | 15.4 | 23.1 |
| EurusPRM-Stage 2 | 67.0 | 53.2 | **35.4** | **30.7** | **42.8** |
| EurusPRM-Stage 2-no-step | 56.6 | 43.0 | 27.3 | 26.8 | 35.1 |
## Citation
```latex
@article{cui2025process,
title={Process reinforcement through implicit rewards},
author={Cui, Ganqu and Yuan, Lifan and Wang, Zefan and Wang, Hanbin and Li, Wendi and He, Bingxiang and Fan, Yuchen and Yu, Tianyu and Xu, Qixin and Chen, Weize and others},
journal={arXiv preprint arXiv:2502.01456},
year={2025}
}
```
```latex
@article{yuan2024implicitprm,
title={Free Process Rewards without Process Labels},
author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
journal={arXiv preprint arXiv:2412.01981},
year={2024}
}
```