yuchenFan commited on
Commit
958c8e6
·
1 Parent(s): 452f3d6

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -7
README.md CHANGED
@@ -20,9 +20,7 @@ The key ingredient of Implicit PRM is the reward representation, as demonstrated
20
  <aside>
21
 
22
 
23
- ***Proposition***
24
-
25
- Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.,
26
 
27
  $$
28
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
@@ -34,13 +32,13 @@ $$
34
  q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
35
  $$
36
 
37
- Here,  is the exponential average of at step .
38
 
39
  $$
40
  q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
41
  $$
42
 
43
- Hence, represents an exact expectation of outcome reward at step , i.e., the Q value.
44
 
45
  The proposition indicates that when modeling
46
 
@@ -48,7 +46,7 @@ $$
48
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
49
  $$
50
 
51
- to train an ORM with the standard pipeline, where is a hyperparameter, can implicitly learn a Q function. Hence, process reward can be obtained by:
52
 
53
  $$
54
  r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
@@ -56,7 +54,13 @@ $$
56
 
57
  Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
58
 
59
- The proposition is agnostic to specific choices of the training objective of ORMs. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the  with
 
 
 
 
 
 
60
 
61
  $$
62
  \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
 
20
  <aside>
21
 
22
 
23
+ ***Proposition***: Consider an ORM where the reward is parameterized by the log-likelihood ratio of two causal LMs, i.e.
 
 
24
 
25
  $$
26
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
 
32
  q_\phi^t(\mathbf{y}{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}{<i})}{\pi\text{ref}(y_{i}|\mathbf{y}_{<i})}.
33
  $$
34
 
35
+ is the exponential average of $r_\theta$ at step $t$.
36
 
37
  $$
38
  q_\phi^t(\mathbf{y}{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}{\leq t})} \left[ e^{\frac{1}{\beta} r\phi(\mathbf{y})} \right]
39
  $$
40
 
41
+ Hence, **$q_\theta^t$**represents an exact expectation of outcome reward $r_\theta$ at step $t$, i.e., the Q value.
42
 
43
  The proposition indicates that when modeling
44
 
 
46
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
47
  $$
48
 
49
+ to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
50
 
51
  $$
52
  r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi\text{ref}(y_{t}|\mathbf{y}_{<t})}.
 
54
 
55
  Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
56
 
57
+ The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
58
+
59
+ $$
60
+ r_\phi \left( \mathbf{y} \right)
61
+ $$
62
+
63
+ with
64
 
65
  $$
66
  \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.