Upload README.md
Browse files
README.md
CHANGED
@@ -20,7 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
|
|
20 |
<aside>
|
21 |
✨
|
22 |
|
23 |
-
***Proposition
|
24 |
|
25 |
$$
|
26 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|
|
|
20 |
<aside>
|
21 |
✨
|
22 |
|
23 |
+
***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e. $r_\phi(\mathbf{y}):= \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}$. Define $q_\phi^t(\mathbf{y}_{<t}, y_t):= \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}$. $q_\theta^t$ is the exponential average of $r_\theta$ at step $t$.*
|
24 |
|
25 |
$$
|
26 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}_{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} e^{\frac{1}{\beta}r_\phi(\mathbf{y})}
|