PRIME-RL
/

EurusPRM-Stage1

Model card Files Files and versions Community

yuchenFan commited on Jan 2

Commit

06334c3

·

1 Parent(s): 74ad90c

Upload README.md

Files changed (1) hide show

README.md +3 -5

README.md CHANGED Viewed

@@ -62,11 +62,11 @@ $$
 \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
 $$
-We applied the above \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
 ## Usage
-We show an example leveraging **EurusPRM-Stage2** below:
 ```python
 coef=0.001
@@ -126,7 +126,6 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
 - For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
 - For EurusPRM-Stage 1, we use the minimum reward across all steps.
-- For EurusPRM-Stage 2, we use the accumulative rewards.
 **Eurus-2-7B-SFT**
@@ -156,9 +155,8 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
 | --- | --- | --- | --- | --- | --- | --- | --- |
 | Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
 | Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
-| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | 48.6 |
 |  | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
-|  | EurusPRM-Stage 2 | **86.0** | 59.0 | 16.7 | 41.4 | 41.5 | **48.9** |
 | Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
 |  | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |

 \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
 $$
+We applied \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
 ## Usage
+We show an example leveraging **EurusPRM-Stage1** below:
 ```python
 coef=0.001
 - For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
 - For EurusPRM-Stage 1, we use the minimum reward across all steps.
 **Eurus-2-7B-SFT**
 | --- | --- | --- | --- | --- | --- | --- | --- |
 | Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
 | Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
+| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | **48.6** |
 |  | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
 | Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
 |  | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |