File size: 10,470 Bytes
3bf4e39
 
 
 
 
 
 
4df377c
3bf4e39
 
 
 
 
 
 
452f3d6
3bf4e39
 
 
 
c9f80e2
 
958c8e6
452f3d6
 
 
 
 
 
 
 
c57b417
452f3d6
 
94c55ff
3bf4e39
 
14ce3ee
3bf4e39
 
c57b417
452f3d6
 
3bf4e39
452f3d6
 
 
3bf4e39
be3201e
3bf4e39
 
94c55ff
3bf4e39
 
452f3d6
 
94c55ff
3bf4e39
452f3d6
3bf4e39
 
6d9a234
3bf4e39
 
94c55ff
3bf4e39
 
 
 
 
 
e4ce44b
 
3bf4e39
e4ce44b
 
 
 
 
 
 
3bf4e39
 
 
 
 
 
 
e4ce44b
3bf4e39
 
 
 
 
 
 
 
 
 
 
 
 
e4ce44b
3bf4e39
e4ce44b
 
3bf4e39
 
 
 
 
 
 
 
 
 
 
e4ce44b
3bf4e39
 
e4ce44b
3bf4e39
e4ce44b
3bf4e39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e519e70
3bf4e39
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
---
license: apache-2.0
---
# EurusPRM-Stage2

## Links

- 📜 [Blog](https://curvy-check-498.notion.site/Process-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f)
- 🤗 [PRIME Collection](https://huggingface.co/PRIME-RL)
- 🤗 [Training Data](https://huggingface.co/datasets/PRIME-RL/EurusPRM-Stage2-Data)

## Introduction

EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.01981)**, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.

<img src="./figs/implicit.png" alt="prm" style="zoom: 33%;" />

The key ingredient of Implicit PRM is the reward representation, as demonstrated below:

<aside>***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.

$$
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
$$

Define

$$
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
$$

is the exponential average of \\(r_\theta\\) at step \\(t\\).

$$
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
$$

Hence, \\(q_\theta^t\\)represents an exact expectation of outcome reward \\(r_\theta\\) at step \\(t\\), i.e., the Q value.

The proposition indicates that when modeling

$$
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
$$

to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:

$$
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
$$

Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.

The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the \\(r_\phi \left( \mathbf{y} \right)\\) with \\(\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}\\).

For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:

$$
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
$$

We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with \\(L_{CE}\\) with a learning rate of 5e-7 and a batch-size of 64.

## Usage

We show an example leveraging **EurusPRM-Stage2** below:

```python
import torch
from transformers import AutoTokenizer,AutoModelForCausalLM
coef=0.001
d = {'query':'Convert the point $(0,3)$ in rectangular coordinates to polar coordinates.  Enter your answer in the form $(r,\\theta),$ where $r > 0$ and $0 \\le \\theta < 2 \\pi.$',
     'answer':[
"Step 1: To convert the point (0,3) from rectangular coordinates to polar coordinates, we need to find the radius (r) and the angle theta (\u03b8).",
            "Step 1: Find the radius (r). The radius is the distance from the origin (0,0) to the point (0,3). Since the x-coordinate is 0, the distance is simply the absolute value of the y-coordinate. So, r = |3| = 3.",
            "Step 2: Find the angle theta (\u03b8). The angle theta is measured counterclockwise from the positive x-axis. Since the point (0,3) lies on the positive y-axis, the angle theta is 90 degrees or \u03c0\/2 radians.",
            "Step 3: Write the polar coordinates. The polar coordinates are (r, \u03b8), where r > 0 and 0 \u2264 \u03b8 < 2\u03c0. In this case, r = 3 and \u03b8 = \u03c0\/2.\n\nTherefore, the polar coordinates of the point (0,3) are (3, \u03c0\/2).\n\n\n\\boxed{(3,\\frac{\\pi}{2})}"
     ]
     }
model = AutoModelForCausalLM.from_pretrained('PRIME-RL/EurusPRM-Stage2')
tokenizer = AutoTokenizer.from_pretrained('PRIME-RL/EurusPRM-Stage2')
ref_model = AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-Math-7B-Instruct')
input_ids = tokenizer.apply_chat_template([
    {"role": "user", "content": d["query"]},
    {"role": "assistant", "content": "\n\n".join(d["answer"])},
], tokenize=True, add_generation_prompt=False,return_tensors='pt')
attention_mask = input_ids!=tokenizer.pad_token_id
step_last_tokens = []
for step_num in range(0, len(d["answer"])+1):
    conv = tokenizer.apply_chat_template([
        {"role":"user", "content":d["query"]},
        {"role":"assistant", "content":"\n\n".join(d["answer"][:step_num])},
    ], tokenize=False, add_generation_prompt=False)
    conv = conv.strip()
    if step_num!=0 and step_num!=len(d['answer']):
        conv+='\n\n'
    currect_ids = tokenizer.encode(conv,add_special_tokens=False)
    step_last_tokens.append(len(currect_ids) - 2)


inputs = {'input_ids':input_ids,'attention_mask':attention_mask,'labels':input_ids}
label_mask = torch.tensor([[0]*step_last_tokens[0]+[1]*(input_ids.shape[-1]-step_last_tokens[0])])
step_last_tokens = torch.tensor([step_last_tokens])

def get_logps(model,inputs):
    logits = model(input_ids=inputs['input_ids'], attention_mask=inputs['attention_mask']).logits
    labels = inputs['labels'][:, 1:].clone().long()
    logits = logits[:, :-1, :]
    labels[labels == -100] = 0
    per_token_logps = torch.gather(logits.log_softmax(-1), dim=2, index=labels.unsqueeze(2)).squeeze(2)
    return per_token_logps

with torch.no_grad():
    per_token_logps = get_logps(model, inputs)
    ref_per_token_logps = get_logps(ref_model,inputs)

raw_reward = per_token_logps - ref_per_token_logps
beta_reward = coef * raw_reward * label_mask[:,1:]
beta_reward = beta_reward.cumsum(-1)
beta_reward = beta_reward.gather(dim=-1, index=step_last_tokens[:,1:])
print(beta_reward)
```

## Evaluation

### Evaluation Base Model

We adopt **Eurus-2-7B-SFT**, **Qwen2.5-7B-Instruct** and **Llama-3.1-70B-Instruct** as generation models to evaluate the performance of our implicit PRM. For all models, we set the sampling temperature as 0.5, *p* of the top-*p* sampling as 1.

### Best-of-N Sampling

We use Best-of-64 as our evaluation metric. The weighting methods are different for several PRMs below.

- For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
- For EurusPRM-Stage 1, we use the minimum reward across all steps.
- For EurusPRM-Stage 2, we use the accumulative rewards.

**Eurus-2-7B-SFT**

| Method | Reward Model | MATH | AMC | AIME_2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 65.1 | 30.1 | 3.3 | 29.8 | 32.7 | 32.2 |
| Majority Voting @ 64 | N/A | 65.6 | 53.0 | 13.3 | 39.1 | 22.4 | 38.7 |
| Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 47.2 | 45.8 | 10.0 | 32.3 | 16.2 | 30.3 |
|  | EurusPRM-Stage 1 | 44.6 | 41.0 | 6.7 | 32.9 | 17.3 | 28.5 |
|  | EurusPRM-Stage 2 | 47.2 | 43.4 | 13.3 | 33.8 | 19.2 | 31.4 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 64.6 | **55.4** | 13.3 | 41.3 | 23.2 | 39.6 |
|  | EurusPRM-Stage 1 | **66.0** | 54.2 | 13.3 | 39.6 | **29.0** | **40.4** |
|  | EurusPRM-Stage 2 | **66.0** | 54.2 | 13.3 | **39.7** | **29.0** | **40.4** |

**Llama-3.1-70B-Instruct**

| Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 64.6 | 30.1 | 16.7 | 31.9 | 35.3 | 35.7 |
| Majority Voting @ 64 | N/A | 80.2 | 53.0 | 26.7 | 40.4 | 38.6 | 47.8 |
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 77.8 | 56.6 | 23.3 | 39.0 | 31.6 | 45.7 |
|  | EurusPRM-Stage 1 | 77.8 | 44.6 | **26.7** | 35.3 | 41.5 | 45.2 |
|  | EurusPRM-Stage 2 | 80.6 | **59.0** | 20.0 | 37.6 | 44.9 | 48.4 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **81.2** | 56.6 | 23.3 | **42.4** | 38.2 | 48.3 |
|  | EurusPRM-Stage 1 | 80.4 | 53.0 | **26.7** | 40.9 | **46.7** | **49.5** |
|  | EurusPRM-Stage 2 | 80.4 | 53.0 | **26.7** | 41.0 | 46.3 | **49.5** |

**Qwen2.5-7B-Instruct**

| Method | Reward Model | MATH | AMC | AIME 2024 | OlympiadBench | Minerva Math | Avg |
| --- | --- | --- | --- | --- | --- | --- | --- |
| Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
| Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 85.2 | **60.2** | **20.0** | **44.7** | 32.7 | 48.6 |
|  | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
|  | EurusPRM-Stage 2 | **86.0** | 59.0 | 16.7 | 41.4 | 41.5 | **48.9** |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
|  | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | 45.2 | 48.0 |
|  | EurusPRM-Stage 2 | 84.8 | 53.0 | 16.7 | 43.2 | **45.6** | 48.7 |



## Citation

```latex
@misc{cui2024process,
  title={Process Reinforcement through Implicit Rewards},
  author={Ganqu Cui and Lifan Yuan and Zefan Wang and Hanbin Wang and Wendi Li and Bingxiang He and Yuchen Fan and Tianyu Yu and Qixin Xu and Weize Chen and Jiarui Yuan and Huayu Chen and Kaiyan Zhang and Xingtai Lv and Shuo Wang and Yuan Yao and Hao Peng and Yu Cheng and Zhiyuan Liu and Maosong Sun and Bowen Zhou and Ning Ding},
  year={2025}
}
```

```latex
@article{yuan2024implicitprm,
  title={Free Process Rewards without Process Labels},
  author={Lifan Yuan and Wendi Li and Huayu Chen and Ganqu Cui and Ning Ding and Kaiyan Zhang and Bowen Zhou and Zhiyuan Liu and Hao Peng},
  journal={arXiv preprint arXiv:2412.01981},
  year={2024}
}
```