yuchenFan commited on
Commit
06334c3
·
1 Parent(s): 74ad90c

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -5
README.md CHANGED
@@ -62,11 +62,11 @@ $$
62
  \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
63
  $$
64
 
65
- We applied the above \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
66
 
67
  ## Usage
68
 
69
- We show an example leveraging **EurusPRM-Stage2** below:
70
 
71
  ```python
72
  coef=0.001
@@ -126,7 +126,6 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
126
 
127
  - For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
128
  - For EurusPRM-Stage 1, we use the minimum reward across all steps.
129
- - For EurusPRM-Stage 2, we use the accumulative rewards.
130
 
131
  **Eurus-2-7B-SFT**
132
 
@@ -156,9 +155,8 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
156
  | --- | --- | --- | --- | --- | --- | --- | --- |
157
  | Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
158
  | Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
159
- | Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | 48.6 |
160
  | | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
161
- | | EurusPRM-Stage 2 | **86.0** | 59.0 | 16.7 | 41.4 | 41.5 | **48.9** |
162
  | Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
163
  | | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |
164
 
 
62
  \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
63
  $$
64
 
65
+ We applied \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
66
 
67
  ## Usage
68
 
69
+ We show an example leveraging **EurusPRM-Stage1** below:
70
 
71
  ```python
72
  coef=0.001
 
126
 
127
  - For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
128
  - For EurusPRM-Stage 1, we use the minimum reward across all steps.
 
129
 
130
  **Eurus-2-7B-SFT**
131
 
 
155
  | --- | --- | --- | --- | --- | --- | --- | --- |
156
  | Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
157
  | Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
158
+ | Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | **48.6** |
159
  | | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
 
160
  | Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
161
  | | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |
162