yuchenFan commited on
Commit
be3201e
·
1 Parent(s): 94c55ff

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -38,7 +38,7 @@ $$
38
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
39
  $$
40
 
41
- Hence, **\\(q_\theta^t\\)**represents an exact expectation of outcome reward **\\(r_\theta\\)** at step \\(t\\), i.e., the Q value.
42
 
43
  The proposition indicates that when modeling
44
 
@@ -46,7 +46,7 @@ $$
46
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
47
  $$
48
 
49
- to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi$\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
50
 
51
  $$
52
  r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
 
38
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
39
  $$
40
 
41
+ Hence, \\(**q_\theta^t**\\)represents an exact expectation of outcome reward \\(**r_\theta**\\) at step \\(t\\), i.e., the Q value.
42
 
43
  The proposition indicates that when modeling
44
 
 
46
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
47
  $$
48
 
49
+ to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
50
 
51
  $$
52
  r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.