Update README.md
Browse files
README.md
CHANGED
@@ -7,28 +7,41 @@ base_model:
|
|
7 |
---
|
8 |
|
9 |
# Introduction
|
10 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
11 |
|
12 |
GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
|
13 |
-
|
14 |
-
- As a
|
|
|
15 |
|
16 |

|
17 |
|
18 |
-
- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM
|
19 |
- Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
|
20 |
- Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
|
|
|
|
|
|
|
21 |
|
22 |
# Model details
|
23 |
-
For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891)
|
24 |
-
- Training data: the 23K conversation data are released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
|
25 |
-
- Base model: we select the [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, 32B) as our base models
|
26 |
|
|
|
|
|
|
|
|
|
27 |
|
28 |
# How to use
|
29 |
-
The evaluation and testing code for GenPRM are available in our GitHub repository: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
|
30 |
|
31 |
-
|
|
|
|
|
|
|
32 |
```python
|
33 |
from transformers import AutoTokenizer
|
34 |
from vllm import LLM, SamplingParams
|
|
|
7 |
---
|
8 |
|
9 |
# Introduction
|
10 |
+
|
11 |
+
We propose **GenPRM**, a strong generative process reward model with the following features:
|
12 |
+
|
13 |
+
- reasoning with explicit **CoT reasoning** and **code verfication** before providing the process judgment;
|
14 |
+
- improving Monte Carlo estimation and hard label with **Relative Progress Estimation (RPE)**;
|
15 |
+
- supporting GenPRM **test-time scaling** in a parallel manner with majority voting;
|
16 |
+
- supporting policy model test-time scaling with GenPRM as **verifiers** or **critics**.
|
17 |
|
18 |
GenPRM achieves state-of-the-art performance across multiple benchmarks in two key roles:
|
19 |
+
|
20 |
+
- **As a verifier**: GenPRM-7B outperforms all classification-based PRMs of comparable size and even surpasses **Qwen2.5-Math-PRM-72B** via test-time scaling.
|
21 |
+
- **As a critic**: GenPRM-7B demonstrates superior critique capabilities, achieving **3.4×** greater performance gains than DeepSeekR1-Distill-Qwen-7B after 3 refinement iterations.
|
22 |
|
23 |

|
24 |
|
25 |
+
- Project Page: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://ryanliu112.github.io/GenPRM)
|
26 |
- Paper: [https://arxiv.org/abs/2504.00891](https://arxiv.org/abs/2504.00891)
|
27 |
- Code: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM)
|
28 |
+
- Awesome Process Reward Models: [Awesome Process Reward Models](https://github.com/RyanLiu112/Awesome-Process-Reward-Models)
|
29 |
+
- HF Paper Link: [GenPRM: Scaling Test-Time Compute of Process Reward Models via Generative Reasoning](https://hf.co/papers/2504.00891)
|
30 |
+
- HF Collection: [GenPRM](https://hf.co/collections/GenPRM/genprm-67ee4936234ba5dd16bb9943)
|
31 |
|
32 |
# Model details
|
|
|
|
|
|
|
33 |
|
34 |
+
For full training details, please refer to our [paper](https://arxiv.org/abs/2504.00891).
|
35 |
+
|
36 |
+
- Training data: 23K SFT data is released in [GenPRM-MATH-Data](https://huggingface.co/datasets/GenPRM/GenPRM-MATH-Data).
|
37 |
+
- Base model: we use [DeepSeek-R1-Distill series](https://huggingface.co/deepseek-ai) (1.5B, 7B, and 32B) as our base models.
|
38 |
|
39 |
# How to use
|
|
|
40 |
|
41 |
+
The evaluation code of GenPRM is available in our GitHub repository: [https://github.com/RyanLiu112/GenPRM](https://github.com/RyanLiu112/GenPRM).
|
42 |
+
|
43 |
+
Here's a minimal example of using GenPRM for rationale generation and process supervision:
|
44 |
+
|
45 |
```python
|
46 |
from transformers import AutoTokenizer
|
47 |
from vllm import LLM, SamplingParams
|