hendrydong commited on
Commit
8a610d8
1 Parent(s): 8efa05a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -4
README.md CHANGED
@@ -2,13 +2,11 @@
2
  This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
3
 
4
  ## Training
5
- The base model is meta-llama/Meta-Llama-3-8B-Instruct.
6
 
7
  We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
8
 
9
 
10
- We train the model for one epoch with a learning rate of 2e-6, batch size 512, cosine learning rate decay with a warmup ratio 0.03.
11
-
12
  ## Uses
13
 
14
  ```python
@@ -45,7 +43,7 @@ We train the model for one epoch with a learning rate of 2e-6, batch size 512, c
45
  ## Results
46
 
47
 
48
- This Reward model is the SOTA open-source RM (Apr 20, 2024).
49
 
50
  | Metric | Score |
51
  |--------------|--------|
 
2
  This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
3
 
4
  ## Training
5
+ The base model is `meta-llama/Meta-Llama-3-8B-Instruct`.
6
 
7
  We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
8
 
9
 
 
 
10
  ## Uses
11
 
12
  ```python
 
43
  ## Results
44
 
45
 
46
+ This Reward model is the SOTA open-source RM (Apr 20, 2024) on Reward-Bench.
47
 
48
  | Metric | Score |
49
  |--------------|--------|