metadata

license: apache-2.0
library_name: transformers
pipeline_tag: text-generation
base_model:
  - Satori-reasoning/Satori-SFT-7B

Satori-RM-7B is the Outcome Reward model for training our RL model Satori-7B-Round2. The usage of Satori-RM-7B can be found in our released RL training code.

Resources

We provide our training datasets:

Full format tuning dataset with 300K unique questions.
RL dataset with 550K unique questions.

Please refer to our blog and research paper for more technical details of Satori.

Blog
Paper

For code, see https://github.com/Satori-reasoning/Satori

Citation

If you find our model and data helpful, please cite our paper:

@misc{shen2025satorireinforcementlearningchainofactionthought,
      title={Satori: Reinforcement Learning with Chain-of-Action-Thought Enhances LLM Reasoning via Autoregressive Search}, 
      author={Maohao Shen and Guangtao Zeng and Zhenting Qi and Zhang-Wei Hong and Zhenfang Chen and Wei Lu and Gregory Wornell and Subhro Das and David Cox and Chuang Gan},
      year={2025},
      eprint={2502.02508},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02508}, 
}