PRIME-RL
/

EurusPRM-Stage1

Model card Files Files and versions Community

yuchenFan commited on 25 days ago

Commit

2c19815

·

1 Parent(s): 0e8ab88

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -59,7 +59,7 @@ The proposition is **agnostic to specific choices of the training objective of O
 For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
 $$
-\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
 $$
 We applied \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.

 For example, DPO already meets our assumption and serves as a strong variant, while in this work, we instantiate our implicit PRM with cross entropy (CE) loss due to memory efficiency:
 $$
+\small \mathcal{L}_{CE} = l \cdot \log \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) + (1 - l) \cdot \log \left[ 1 - \sigma \left( \beta \log \frac{\pi_\phi(\mathbf{y})}{\pi_\text{ref}(\mathbf{y})} \right) \right]
 $$
 We applied \\(L_{CE}\\) to train implicit PRM. We used a learning rate of 5e-7 and a batch-size of 64 for training.