Upload README.md
Browse files
README.md
CHANGED
@@ -29,7 +29,7 @@ $$
|
|
29 |
Define
|
30 |
|
31 |
$$
|
32 |
-
q_\phi^t(\mathbf{y}_{<t}, y_t) := \
|
33 |
$$
|
34 |
|
35 |
is the exponential average of \\(r_\theta\\) at step \\(t\\).
|
@@ -38,7 +38,7 @@ $$
|
|
38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
39 |
$$
|
40 |
|
41 |
-
Hence, \\(
|
42 |
|
43 |
The proposition indicates that when modeling
|
44 |
|
|
|
29 |
Define
|
30 |
|
31 |
$$
|
32 |
+
q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum_{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
|
33 |
$$
|
34 |
|
35 |
is the exponential average of \\(r_\theta\\) at step \\(t\\).
|
|
|
38 |
q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
|
39 |
$$
|
40 |
|
41 |
+
Hence, \\(q_\theta^t\\)represents an exact expectation of outcome reward \\(r_\theta\\) at step \\(t\\), i.e., the Q value.
|
42 |
|
43 |
The proposition indicates that when modeling
|
44 |
|