hendrydong
commited on
Commit
•
8a610d8
1
Parent(s):
8efa05a
Update README.md
Browse files
README.md
CHANGED
@@ -2,13 +2,11 @@
|
|
2 |
This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
|
3 |
|
4 |
## Training
|
5 |
-
The base model is meta-llama/Meta-Llama-3-8B-Instruct
|
6 |
|
7 |
We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
|
8 |
|
9 |
|
10 |
-
We train the model for one epoch with a learning rate of 2e-6, batch size 512, cosine learning rate decay with a warmup ratio 0.03.
|
11 |
-
|
12 |
## Uses
|
13 |
|
14 |
```python
|
@@ -45,7 +43,7 @@ We train the model for one epoch with a learning rate of 2e-6, batch size 512, c
|
|
45 |
## Results
|
46 |
|
47 |
|
48 |
-
This Reward model is the SOTA open-source RM (Apr 20, 2024).
|
49 |
|
50 |
| Metric | Score |
|
51 |
|--------------|--------|
|
|
|
2 |
This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
|
3 |
|
4 |
## Training
|
5 |
+
The base model is `meta-llama/Meta-Llama-3-8B-Instruct`.
|
6 |
|
7 |
We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
|
8 |
|
9 |
|
|
|
|
|
10 |
## Uses
|
11 |
|
12 |
```python
|
|
|
43 |
## Results
|
44 |
|
45 |
|
46 |
+
This Reward model is the SOTA open-source RM (Apr 20, 2024) on Reward-Bench.
|
47 |
|
48 |
| Metric | Score |
|
49 |
|--------------|--------|
|