OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling
Llama_32_3B_megamath_web_pro_megamath_synth_qa_91_bs4M_seq8k_20B
What makes a base language model suitable for RL? Understanding these factors behind divergent RL Scaling remains vital.
We aim to investigate the impact of several factors on RL performance during mid-training through head-to-head experiments, as shown in the Figure below. Specifically, we examine the effects of data quality of math web corpora, the inclusion or exclusion of QA-format data, the nature of the QA data itself, the presence of general instruction-following data in mid-training, as well as the pre-training token budget. These systematic analyses help us gain a deeper understanding of the connection betweenpre-training and RL dynamics and figure out suitable recipes for scaled-up mid-training.

Mid-training Configuration

RL Training Dynamics

Takeaway: QA data could aid RL scaling, but gains depend on its distribution gap with downstream tasks. Long CoT patterns often induce excessive responses and sudden performance drops in RL-tuned models.
More about OctoThinker

Citation
Check out our paper for more details. If you use our models, datasets or find our work useful, please cite
@article{wang2025octothinker,
title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
year={2025},
journal={arXiv preprint arXiv:2506.20512},
note={Preprint}
}
Model tree for OctoThinker/Llama_32_3B_megamath_web_pro_open_r1_longcot_91_bs4M_seq8k_20B
Base model
meta-llama/Llama-3.2-3B