Upload README.md
Browse files
README.md
CHANGED
@@ -20,9 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
|
|
20 |
<aside>
|
21 |
✨
|
22 |
|
23 |
-
***Proposition
|
24 |
-
|
25 |
-
Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.,
|
26 |
|
27 |
$$
|
28 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
@@ -34,13 +32,13 @@ $$
|
|
34 |
q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
35 |
$$
|
36 |
|
37 |
-
|
38 |
|
39 |
$$
|
40 |
q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
|
41 |
$$
|
42 |
|
43 |
-
Hence,
|
44 |
|
45 |
The proposition indicates that when modeling
|
46 |
|
@@ -48,7 +46,7 @@ $$
|
|
48 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
49 |
$$
|
50 |
|
51 |
-
to train an ORM with the standard pipeline, where
|
52 |
|
53 |
$$
|
54 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
@@ -56,7 +54,13 @@ $$
|
|
56 |
|
57 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
58 |
|
59 |
-
The proposition is agnostic to specific choices of the training objective of ORMs
|
|
|
|
|
|
|
|
|
|
|
|
|
60 |
|
61 |
$$
|
62 |
\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
|
|
20 |
<aside>
|
21 |
✨
|
22 |
|
23 |
+
***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.
|
|
|
|
|
24 |
|
25 |
$$
|
26 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|
|
|
32 |
q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
33 |
$$
|
34 |
|
35 |
+
is the exponential average of $r_\theta$ at step $t$.
|
36 |
|
37 |
$$
|
38 |
q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
|
39 |
$$
|
40 |
|
41 |
+
Hence, **$q_\theta^t$**represents an exact expectation of outcome reward $r_\theta$ at step $t$, i.e., the Q value.
|
42 |
|
43 |
The proposition indicates that when modeling
|
44 |
|
|
|
46 |
r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
|
47 |
$$
|
48 |
|
49 |
+
to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
|
50 |
|
51 |
$$
|
52 |
r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
|
|
|
54 |
|
55 |
Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
|
56 |
|
57 |
+
The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
|
58 |
+
|
59 |
+
$$
|
60 |
+
r_\phi \left( \mathbf{y} \right)
|
61 |
+
$$
|
62 |
+
|
63 |
+
with
|
64 |
|
65 |
$$
|
66 |
\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
|