Upload README.md
Browse files
README.md
CHANGED
@@ -62,11 +62,11 @@ $$
|
|
62 |
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
63 |
$$
|
64 |
|
65 |
-
We applied
|
66 |
|
67 |
## Usage
|
68 |
|
69 |
-
We show an example leveraging **EurusPRM-
|
70 |
|
71 |
```python
|
72 |
coef=0.001
|
@@ -126,7 +126,6 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
|
|
126 |
|
127 |
- For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
|
128 |
- For EurusPRM-Stage 1, we use the minimum reward across all steps.
|
129 |
-
- For EurusPRM-Stage 2, we use the accumulative rewards.
|
130 |
|
131 |
**Eurus-2-7B-SFT**
|
132 |
|
@@ -156,9 +155,8 @@ We use Best-of-64 as our evaluation metric. The weighting methods are different
|
|
156 |
| --- | --- | --- | --- | --- | --- | --- | --- |
|
157 |
| Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
|
158 |
| Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
|
159 |
-
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | 48.6 |
|
160 |
| | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
|
161 |
-
| | EurusPRM-Stage 2 | **86.0** | 59.0 | 16.7 | 41.4 | 41.5 | **48.9** |
|
162 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
|
163 |
| | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |
|
164 |
|
|
|
62 |
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
63 |
$$
|
64 |
|
65 |
+
We applied \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.
|
66 |
|
67 |
## Usage
|
68 |
|
69 |
+
We show an example leveraging **EurusPRM-Stage1** below:
|
70 |
|
71 |
```python
|
72 |
coef=0.001
|
|
|
126 |
|
127 |
- For [Skywork-o1-Open-PRM-Qwen-2.5-7B](https://huggingface.co/Skywork/Skywork-o1-Open-PRM-Qwen-2.5-7B), we use simple average reward across all steps.
|
128 |
- For EurusPRM-Stage 1, we use the minimum reward across all steps.
|
|
|
129 |
|
130 |
**Eurus-2-7B-SFT**
|
131 |
|
|
|
155 |
| --- | --- | --- | --- | --- | --- | --- | --- |
|
156 |
| Greedy Pass @ 1 | N/A | 73.3 | 47.0 | 13.3 | 39.4 | 35.3 | 41.7 |
|
157 |
| Majority Voting @ 64 | N/A | 82.0 | 53.0 | 16.7 | 43.0 | 36.4 | 46.2 |
|
158 |
+
| Best-of-N @ 64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | **85.2** | **60.2** | **20.0** | **44.7** | 32.7 | **48.6** |
|
159 |
| | EurusPRM-Stage 1 | 81.8 | 47.0 | 16.7 | 40.1 | 41.5 | 45.4 |
|
|
|
160 |
| Weighted Best-of-64 | Skywork-o1-Open-PRM-Qwen-2.5-7B | 83.6 | 55.4 | 13.3 | 43.7 | 36.8 | 46.6 |
|
161 |
| | EurusPRM-Stage 1 | 82.6 | 53.0 | 16.7 | 42.7 | **45.2** | 48.0 |
|
162 |
|