SFT Models
Collection
We train a series of SFT models on the high-quality SFT dataset of RLHFlow for research purpose.
•
6 items
•
Updated
•
1
This is the SFT checkpoint used for the project RLHFlow/Online-RLHF
The model is trained from meta-llama/Meta-Llama-3-8B on RLHFlow/RLHFlow-SFT-Dataset-ver2 for 2 epochs. We use a global batch size of 128 and a learning rate of 2e-5, where we pack the samples and split them into chunks of 8192 token. See more training details at https://github.com/RLHFlow/Online-RLHF/blob/main/sft/llama3-8b-it.yaml .
We use ToRA script to evaluate GSM8K and MATH, Evalplut for HumanEval, and lm-evaluation-harness for other benchmarks. The model is evaluated in zero-shot setting.
Model | Size | Method | LC AlpacaEval | MT-Bench | GSM-8K | MATH | MMLU | HumanEval | TruthfulQA | ARC |
---|---|---|---|---|---|---|---|---|---|---|
LLaMA-3-8B-it | 8B | RS+DPO+PPO | 22.9 | 8.16 | 79.6 | 26.3 | 66.0 | 61.6 | 43.9 | 59.5 |
RLHFlow/LLaMA3-SFT | 8B | SFT | 10.2 | 7.69 | 74.2 | 30.0 | 64.6 | 63.4 | 53.5 | 58.6 |
RLHFlow/LLaMA3-SFT-v2 | 8B | SFT | 12.66 | - | 83.4 | 41.1 | 64.8 | 66.5 | 53.9 | 60.0 |
Please cite our techical report if you find our model is useful for your research or product.
@misc{dong2024rlhf,
title={RLHF Workflow: From Reward Modeling to Online RLHF},
author={Hanze Dong and Wei Xiong and Bo Pang and Haoxiang Wang and Han Zhao and Yingbo Zhou and Nan Jiang and Doyen Sahoo and Caiming Xiong and Tong Zhang},
year={2024},
eprint={2405.07863},
archivePrefix={arXiv},
primaryClass={cs.LG}
}