Upload README.md
Browse files
README.md
CHANGED
@@ -32,13 +32,13 @@ $$
|
|
32 |
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
33 |
$$
|
34 |
|
35 |
-
is the exponential average of
|
36 |
|
37 |
$$
|
38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
39 |
$$
|
40 |
|
41 |
-
Hence,
|
42 |
|
43 |
The proposition indicates that when modeling
|
44 |
|
@@ -46,25 +46,15 @@ $$
|
|
46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
47 |
$$
|
48 |
|
49 |
-
to train an ORM with the standard pipeline, where
|
50 |
|
51 |
$$
|
52 |
-
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
53 |
$$
|
54 |
|
55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
56 |
|
57 |
-
The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
|
58 |
-
|
59 |
-
$$
|
60 |
-
r_\phi \left( \mathbf{y} \right)
|
61 |
-
$$
|
62 |
-
|
63 |
-
with
|
64 |
-
|
65 |
-
$$
|
66 |
-
\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
67 |
-
$$
|
68 |
|
69 |
For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
|
70 |
|
@@ -72,7 +62,7 @@ $$
|
|
72 |
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
73 |
$$
|
74 |
|
75 |
-
We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with
|
76 |
|
77 |
## Usage
|
78 |
|
|
|
32 |
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
33 |
$$
|
34 |
|
35 |
+
is the exponential average of \\(r_\theta\\) at step \\(t\\).
|
36 |
|
37 |
$$
|
38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
39 |
$$
|
40 |
|
41 |
+
Hence, **\\(q_\theta^t\\)**represents an exact expectation of outcome reward **\\(r_\theta\\)** at step \\(t\\), i.e., the Q value.
|
42 |
|
43 |
The proposition indicates that when modeling
|
44 |
|
|
|
46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
47 |
$$
|
48 |
|
49 |
+
to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi$\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
|
50 |
|
51 |
$$
|
52 |
+
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
53 |
$$
|
54 |
|
55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
56 |
|
57 |
+
The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the \\(r_\phi \left( \mathbf{y} \right)\\) with \\(\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}\\).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
58 |
|
59 |
For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
|
60 |
|
|
|
62 |
\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
|
63 |
$$
|
64 |
|
65 |
+
We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with \\(L_{CE}\\) with a learning rate of 5e-7 and a batch-size of 64.
|
66 |
|
67 |
## Usage
|
68 |
|