Weak-to-Strong
Collection
weak-to-strong trained models
•
2 items
•
Updated
This model is WSPO-trained using the DeepSeek-R1-Qwen-Distill-7B on the Openthought-220 long thought dataset with the help of DeepSeek-R1-Qwen-Distill-1.5B and DeepScaleR-1.5B-Preview
@inproceedings{
zhu2025weaktostrong,
title={Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model},
author={Wenhong Zhu and Zhiwei He and Xiaofeng Wang and Pengfei Liu and Rui Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=f7KxfUrRSb}
}