hendrydong commited on
Commit
8efa05a
1 Parent(s): 1b2d0b0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -0
README.md CHANGED
@@ -1,3 +1,6 @@
 
 
 
1
  ## Training
2
  The base model is meta-llama/Meta-Llama-3-8B-Instruct.
3
 
 
1
+
2
+ This reward function can be used for RLHF, including PPO, iterative SFT, iterative DPO.
3
+
4
  ## Training
5
  The base model is meta-llama/Meta-Llama-3-8B-Instruct.
6