OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Llama_32_3B_megamath_web_pro_megamath_synth_qa_91_bs4M_seq8k_20B

What makes a base language model suitable for RL? Understanding these factors behind divergent RL Scaling remains vital.

We aim to investigate the impact of several factors on RL performance during mid-training through head-to-head experiments, as shown in the Figure below. Specifically, we examine the effects of data quality of math web corpora, the inclusion or exclusion of QA-format data, the nature of the QA data itself, the presence of general instruction-following data in mid-training, as well as the pre-training token budget. These systematic analyses help us gain a deeper understanding of the connection betweenpre-training and RL dynamics and figure out suitable recipes for scaled-up mid-training.

Mid-training Configuration

RL Training Dynamics

Takeaway: QA data could aid RL scaling, but gains depend on its distribution gap with downstream tasks. Long CoT patterns often induce excessive responses and sudden performance drops in RL-tuned models.

More about OctoThinker

Citation

Check out our paper for more details. If you use our models, datasets or find our work useful, please cite

@article{wang2025octothinker,
  title={OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling},
  author={Wang, Zengzhi and Zhou, Fan and Li, Xuefeng and Liu, Pengfei},
  year={2025},
  journal={arXiv preprint arXiv:2506.20512},
  note={Preprint}
}

OctoThinker
/

Llama_32_3B_megamath_web_pro_open_r1_longcot_91_bs4M_seq8k_20B

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling

Llama_32_3B_megamath_web_pro_megamath_synth_qa_91_bs4M_seq8k_20B

Mid-training Configuration

RL Training Dynamics

More about OctoThinker

Citation

Model tree for OctoThinker/Llama_32_3B_megamath_web_pro_open_r1_longcot_91_bs4M_seq8k_20B

Datasets used to train OctoThinker/Llama_32_3B_megamath_web_pro_open_r1_longcot_91_bs4M_seq8k_20B

Collection including OctoThinker/Llama_32_3B_megamath_web_pro_open_r1_longcot_91_bs4M_seq8k_20B

Mid-training Analysis Checkpoints (Llama-3.2-3B)