yuchenFan commited on
Commit
f823335
·
1 Parent(s): 4ce7ab9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -1
README.md CHANGED
@@ -18,7 +18,9 @@ EurusPRM-Stage2 is trained using **[Implicit PRM](https://arxiv.org/abs/2412.019
18
  The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
19
 
20
  <aside>
21
- ***Proposition**: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
 
 
22
 
23
  $$
24
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
 
18
  The key ingredient of Implicit PRM is the reward representation, as demonstrated below:
19
 
20
  <aside>
21
+
22
+
23
+ ***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
24
 
25
  $$
26
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}