yuchenFan commited on
Commit
94c55ff
·
1 Parent(s): 14ce3ee

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +6 -16
README.md CHANGED
@@ -32,13 +32,13 @@ $$
32
  q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
33
  $$
34
 
35
- is the exponential average of $ r_\theta $ at step $ t $.
36
 
37
  $$
38
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
39
  $$
40
 
41
- Hence, **$ q_\theta^t $**represents an exact expectation of outcome reward $ r_\theta $ at step $t$, i.e., the Q value.
42
 
43
  The proposition indicates that when modeling
44
 
@@ -46,25 +46,15 @@ $$
46
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
47
  $$
48
 
49
- to train an ORM with the standard pipeline, where $\beta$ is a hyperparameter, $\phi$ can implicitly learn a Q function. Hence, process reward $r_\phi^t$ can be obtained by:
50
 
51
  $$
52
- r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
53
  $$
54
 
55
  Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
56
 
57
- The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the
58
-
59
- $$
60
- r_\phi \left( \mathbf{y} \right)
61
- $$
62
-
63
- with
64
-
65
- $$
66
- \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}.
67
- $$
68
 
69
  For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
70
 
@@ -72,7 +62,7 @@ $$
72
  \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
73
  $$
74
 
75
- We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with $L_{CE}$ with a learning rate of 5e-7 and a batch-size of 64.
76
 
77
  ## Usage
78
 
 
32
  q_\phi^t(\mathbf{y}_{<t}, y_t) := \sum{i=1}^{t} \beta \log \frac{\pi_\phi(y_{i}|\mathbf{y}_{<i})}{\pi_\text{ref}(y_{i}|\mathbf{y}_{<i})}.
33
  $$
34
 
35
+ is the exponential average of \\(r_\theta\\) at step \\(t\\).
36
 
37
  $$
38
  q_\phi^t(\mathbf{y}_{<t}, y_t) = \beta \log \mathbb{E}{\pi_\text{ref}(\mathbf{y}|\mathbf{y}_{\leq t})} \left[ e^{\frac{1}{\beta} r_\phi(\mathbf{y})} \right]
39
  $$
40
 
41
+ Hence, **\\(q_\theta^t\\)**represents an exact expectation of outcome reward **\\(r_\theta\\)** at step \\(t\\), i.e., the Q value.
42
 
43
  The proposition indicates that when modeling
44
 
 
46
  r_\phi(\mathbf{y}) := \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}
47
  $$
48
 
49
+ to train an ORM with the standard pipeline, where \\(\beta\\) is a hyperparameter, \\(\phi$\\) can implicitly learn a Q function. Hence, process reward \\(r_\phi^t\\) can be obtained by:
50
 
51
  $$
52
+ r_\phi^t := q_\phi^t - q_\phi^{t-1} = \beta \log \frac{\pi_\phi(y_{t}|\mathbf{y}_{<t})}{\pi_\text{ref}(y_{t}|\mathbf{y}_{<t})}.
53
  $$
54
 
55
  Therefore, we can indeed obtain PRMs simply by collecting response-level data and training an ORM, without any burden of annotating step labels.
56
 
57
+ The proposition is **agnostic to specific choices of the training objective of ORMs**. It can be instantiated with different objectives as vanilla ORM training, with the only difference being substituting the \\(r_\phi \left( \mathbf{y} \right)\\) with \\(\beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})}\\).
 
 
 
 
 
 
 
 
 
 
58
 
59
  For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
60
 
 
62
  \small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
63
  $$
64
 
65
+ We started the second-stage training on top of [EurusPRM-Stage1](https://huggingface.co/PRIME-RL/EurusPRM-Stage1) with fine-grained step-level labels. To obtain step-level labels, we employed Llama-3.1-70B-Inst and Qwen2.5-72B-Inst to insert nuance errors into correct solutions. We also mixed response-level data in this stage. The model was continually trained with \\(L_{CE}\\) with a learning rate of 5e-7 and a batch-size of 64.
66
 
67
  ## Usage
68