metadata
license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model:
- Satori-reasoning/Satori-SFT-7B
Satori-RM-7B is the Outcome Reward model for training our RL model Satori-7B-Round2. The usage of Satori-RM-7B can be found in our released RL training code.
Resources
We provide our training datasets:
- Full format tuning dataset with 300K unique questions.
- RL dataset with 550K unique questions.
Please refer to our blog and research paper for more technical details of Satori.
For code, see https://github.com/Satori-reasoning/Satori
Citation
If you find our model and data helpful, please cite our paper:
@misc{shen2025satorireinforcementlearningchainofactionthought,
title={Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search},
author={Maohao Shen and Guangtao Zeng and Zhenting Qi and Zhang-Wei Hong and Zhenfang Chen and Wei Lu and Gregory Wornell and Subhro Das and David Cox and Chuang Gan},
year={2025},
eprint={2502.02508},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02508},
}