wh-zhu/DeepScaleR-7B-WSPO

This model is WSPO-trained using the DeepSeek-R1-Qwen-Distill-7B on the Openthought-220 long thought dataset with the help of DeepSeek-R1-Qwen-Distill-1.5B and DeepScaleR-1.5B-Preview

Citation

@inproceedings{
zhu2025weaktostrong,
title={Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model},
author={Wenhong Zhu and Zhiwei He and Xiaofeng Wang and Pengfei Liu and Rui Wang},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=f7KxfUrRSb}
}

wh-zhu
/

DeepScaleR-7B-WSPO

Citation

Model tree for wh-zhu/DeepScaleR-7B-WSPO

Collection including wh-zhu/DeepScaleR-7B-WSPO

Weak-to-Strong