--- license: apache-2.0 library_name: transformers pipeline_tag: text-generation base_model: - Satori-reasoning/Satori-SFT-7B --- **Satori-RM-7B** is the Outcome Reward model for training our RL model [Satori-7B-Round2](https://huggingface.co/Satori-reasoning/Satori-7B-Round2). The usage of **Satori-RM-7B** can be found in our released [RL training code](https://github.com/satori-reasoning/Satori). # **Resources** We provide our training datasets: - [Full format tuning dataset](https://huggingface.co/datasets/Satori-reasoning/Satori_FT_data) with 300K unique questions. - [RL dataset](https://huggingface.co/datasets/Satori-reasoning/Satori_RL_data) with 550K unique questions. Please refer to our blog and research paper for more technical details of Satori. - [Blog](https://satori-reasoning.github.io/blog/satori/) - [Paper](https://arxiv.org/pdf/2502.02508) For code, see https://github.com/Satori-reasoning/Satori # **Citation** If you find our model and data helpful, please cite our paper: ``` @misc{shen2025satorireinforcementlearningchainofactionthought, title={Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search}, author={Maohao Shen and Guangtao Zeng and Zhenting Qi and Zhang-Wei Hong and Zhenfang Chen and Wei Lu and Gregory Wornell and Subhro Das and David Cox and Chuang Gan}, year={2025}, eprint={2502.02508}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2502.02508}, } ```