Upload README.md
Browse files
README.md
CHANGED
@@ -13,14 +13,14 @@ license: apache-2.0
|
|
13 |
|
14 |
EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.01981)**, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.
|
15 |
|
16 |
-
<img src="./
|
17 |
|
18 |
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
|
19 |
|
20 |
<aside>
|
21 |
-
✨
|
22 |
-
|
23 |
-
***Proposition
|
24 |
|
25 |
$$
|
26 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|
|
|
13 |
|
14 |
EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.01981)**, which obtains free process rewards at no additional cost but just needs to simply train an ORM on the cheaper response-level labels. During inference, implicit process rewards are obtained by forward passing and calculating the log-likelihood ratio on each step.
|
15 |
|
16 |
+
<img src="./figures/implicit.png" alt="prm" style="zoom: 33%;" />
|
17 |
|
18 |
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
|
19 |
|
20 |
<aside>
|
21 |
+
✨
|
22 |
+
|
23 |
+
***Proposition**: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
|
24 |
|
25 |
$$
|
26 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|