📝 Trust Region Preference Approximation: A simple and stable reinforcement learning algorithm for LLM reasoning
Checkpoints of Trust Region Preference Approximation (TRPA) on DeepScaleR based on DeepSeek-R1-Distill-Qwen-7B using the 0326 version of verl0326
📚 Citation
@article{su2025trust,
title={Trust region preference approximation: A simple and stable reinforcement learning algorithm for llm reasoning},
author={Su, Xuerui and Xie, Shufang and Liu, Guoqing and Xia, Yingce and Luo, Renqian and Jin, Peiran and Ma, Zhiming and Wang, Yue and Wang, Zun and Liu, Yuting},
journal={arXiv preprint arXiv:2504.04524},
year={2025}
}
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for Xuerui2312/DeepSeek-R1-Distill-Qwen-7B-TRPA-DeepScaleR-verl0326
Base model
deepseek-ai/DeepSeek-R1-Distill-Qwen-7B