LUFFY-RL
Collection
4 items
•
Updated
•
3
The base Qwen2.5-Math-7B model used by LUFFY. We change to rope_theta from 10000 to 40000 and extend the context window to 16k. Also, we modify the chat_template for the system prompt and add .
If you find our model, data, or evaluation code useful, please kindly cite our paper:
@misc{luffy,
title={Learning to Reason under Off-Policy Guidance},
author={Jianhao Yan and Yafu Li and Zican Hu and Zhi Wang and Ganqu Cui and Xiaoye Qu and Yu Cheng and Yue Zhang},
year={2025},
eprint={2504.14945},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2504.14945},
}