yuchenFan commited on
Commit
b099578
·
1 Parent(s): c9f80e2

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -20,7 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
20
  <aside>
21
 
22
 
23
- ***Proposition**: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
24
 
25
  $$
26
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
 
20
  <aside>
21
 
22
 
23
+ ***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
24
 
25
  $$
26
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}