RLHFlow/LLaMA3-SFT-v2 · Hugging Face

This is the SFT checkpoint used for the project RLHFlow/Online-RLHF

Paper: RLHF Workflow: From Reward Modeling to Online RLHF (Published in TMLR, 2024)
Authors: Hanze Dong*, Wei Xiong*, Bo Pang*, Haoxiang Wang*, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
Code: https://github.com/RLHFlow/Online-RLHF

The model is trained from meta-llama/Meta-Llama-3-8B on RLHFlow/RLHFlow-SFT-Dataset-ver2 for 2 epochs. We use a global batch size of 128 and a learning rate of 2e-5, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/sft/llama3-8b-it.yaml .

Academic Benchmarks

We use ToRA script to evaluate GSM8K and MATH, Evalplut for HumanEval, and lm-evaluation-harness for other benchmarks. The model is evaluated in zero-shot setting.

Model	Size	Method	LC AlpacaEval	MT-Bench	GSM-8K	MATH	MMLU	HumanEval	TruthfulQA	ARC
LLaMA-3-8B-it	8B	RS+DPO+PPO	22.9	8.16	79.6	26.3	66.0	61.6	43.9	59.5
RLHFlow/LLaMA3-SFT	8B	SFT	10.2	7.69	74.2	30.0	64.6	63.4	53.5	58.6
RLHFlow/LLaMA3-SFT-v2	8B	SFT	12.66	-	83.4	41.1	64.8	66.5	53.9	60.0

Citation

Please cite our techical report if you find our model is useful for your research or product.

@misc{dong2024rlhf,
      title={RLHF Workflow: From Reward Modeling to Online RLHF}, 
      author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
      year={2024},
      eprint={2405.07863},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

RLHFlow
/

LLaMA3-SFT-v2

Academic Benchmarks

Citation

Model tree for RLHFlow/LLaMA3-SFT-v2

Collection including RLHFlow/LLaMA3-SFT-v2

SFT Models