hendrydong commited on
Commit
1c58e48
1 Parent(s): 096b3be

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -2
README.md CHANGED
@@ -6,8 +6,6 @@ The base model is `meta-llama/Meta-Llama-3-8B-Instruct`.
6
 
7
  We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
8
 
9
- You can also refer to a short blog for RM training details: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0.
10
-
11
 
12
  ## Uses
13
 
@@ -54,6 +52,9 @@ This Reward model is the SOTA open-source RM (Apr 20, 2024) on Reward-Bench.
54
  | Safety | 88.76 |
55
  | Reasoning | 88.3 |
56
 
 
 
 
57
 
58
  ## Reference
59
  The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows:
 
6
 
7
  We use the training script at `https://github.com/WeiXiongUST/RLHF-Reward-Modeling`.
8
 
 
 
9
 
10
  ## Uses
11
 
 
52
  | Safety | 88.76 |
53
  | Reasoning | 88.3 |
54
 
55
+ ## See also
56
+
57
+ You can also refer to our short blog for RM training details: https://www.notion.so/Reward-Modeling-for-RLHF-abe03f9afdac42b9a5bee746844518d0.
58
 
59
  ## Reference
60
  The repo was part of the iterative rejection sampling fine-tuning and iterative DPO. If you find the content of this repo useful in your work, please consider cite it as follows: