Update README.md
Browse files
README.md
CHANGED
@@ -18,7 +18,9 @@ EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.019
|
|
18 |
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
|
19 |
|
20 |
<aside>
|
21 |
-
✨
|
|
|
|
|
22 |
|
23 |
$$
|
24 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|
|
|
18 |
The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
|
19 |
|
20 |
<aside>
|
21 |
+
✨
|
22 |
+
|
23 |
+
***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
|
24 |
|
25 |
$$
|
26 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|