arxiv:2506.08007

Reinforcement Pre-Training

Published on Jun 9

· Submitted by

unilm on Jun 10

#1 Paper of the day

Upvote

200

Authors:

Li Dong ,

Yao Tang ,

Tianzhu Ye ,

Furu Wei

Abstract

Reinforcement Pre-Training (RPT) improves language model accuracy through reinforcement learning and offers a scalable method for leveraging text data for general-purpose RL.

AI-generated summary

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

View arXiv page View PDF Add to collection

Community

unilm

Paper author Paper submitter 6 days ago

cppowboy

6 days ago

What is the performance on other math reasoning benchmarks? (aime24, aime25, math500, etc.)

ytz20

Paper author 4 days ago

Thanks for the question. According to Table 2 'Before RL' column, RPT achieves stronger performance on math problems before reinforcement finetuning.
We’ve also achieved positive results on the math datasets you mentioned. We're continuing to scale up and organize our work, and in the coming period, we’ll release evaluation results from larger-scale experiments, which will include the math datasets you're interested in.

fungamer2

5 days ago

I never thought RL could be used for pre-training

beyondamola

5 days ago

Seems cool. You could say this is 'NTR' [next token reasoning].

xixy

5 days ago

🤣

tsaganshosg

5 days ago

excellent paper。but i wonder the cost of training。causal mask in original gpt can increase the efficiency of pre-training. But in this work, I find that it is hard to bring in the causal mask in RPT, so won't it increase the cost of RPT?

TonyCWang

3 days ago

I wonder the same. In my interpretation it's pretraining in the sense that it's self supervised training on a curated dataset. But it's not the same as standard pretraining compute-efficiency wise

mrconter1

4 days ago

What would happen if you applied RPT recursively - having the model reason about each token within its own reasoning chain? Would meta-reasoning about the reasoning process itself lead to even better performance, or would the computational overhead outweigh the benefits? :)

fungamer2

4 days ago

I see the paper says RPT is initialized from a reasoning model and mentions investigating RPT from a standard base LLM under Future Work. I wonder how or whether the training and thought process would be different being initialized from a base LLM instead of a reasoning model

ytz20

Paper author 2 days ago

We're working on it. Stay tuned!

librarian-bot

4 days ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

TonyCWang

3 days ago

If I am not mistaken your approach doesn't allow the massively parallel scaling from standard pre training, so you shouldn't be constrained to just next token prediction.

Have you considered other RL objectives inspired by pre-training besides next token prediction? Like masked token prediction and next sentence prediction from BERT.

ykarout

3 days ago

Can you provide your fine-tuning code? I am interested in applying the same on the reasoning model MiMo-7B using the same proxy model for entropy but pre-processing the same dataset first then using PPO with binary rewards. Do you think this is achievable on a single H100? using vllm for generation and splitting the vllm/train by a ratio of 30%/70% with shorter sequence length as the MiMo model doesn't tend to be verbose a lot. Also, when using the dataset, are you combining the question with the answer and doing next token prediction on the whole text or just the answer? I have created a training code but really interested in seeing your implementation as this needs memory efficiency and speed.