Harnessing Negative Signals: Reinforcement Distillation from Teacher Data for LLM Reasoning
Abstract
Reinforcement Distillation (REDI) leverages both positive and negative traces to enhance large language model reasoning performance offline, outperforming traditional methods and achieving state-of-the-art results with limited open data.
Recent advances in model distillation demonstrate that data from advanced reasoning models (e.g., DeepSeek-R1, OpenAI's o1) can effectively transfer complex reasoning abilities to smaller, efficient student models. However, standard practices employ rejection sampling, discarding incorrect reasoning examples -- valuable, yet often underutilized data. This paper addresses the critical question: How can both positive and negative distilled reasoning traces be effectively leveraged to maximize LLM reasoning performance in an offline setting? To this end, We propose Reinforcement Distillation (REDI), a two-stage framework. Stage 1 learns from positive traces via Supervised Fine-Tuning (SFT). Stage 2 further refines the model using both positive and negative traces through our proposed REDI objective. This novel objective is a simple, reference-free loss function that outperforms established methods like DPO and SimPO in this distillation context. Our empirical evaluations demonstrate REDI's superiority over baseline Rejection Sampling SFT or SFT combined with DPO/SimPO on mathematical reasoning tasks. Notably, the Qwen-REDI-1.5B model, post-trained on just 131k positive and negative examples from the open Open-R1 dataset, achieves an 83.1% score on MATH-500 (pass@1). Its performance matches or surpasses that of DeepSeek-R1-Distill-Qwen-1.5B (a model post-trained on 800k proprietary data) across various mathematical reasoning benchmarks, establishing a new state-of-the-art for 1.5B models post-trained offline with openly available data.
Community
interesting discussions on incorporating negative reasoning traces into distillation
We introduce Reinforcement Distillation (REDI), a method proving that incorrect (negative) reasoning examples are valuable for distillation. Our Qwen-REDI-1.5B
model, trained with REDI, sets a new state-of-the-art for math reasoning among models post-trained offline with open data.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Advancing Multimodal Reasoning via Reinforcement Learning with Cold Start (2025)
- AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning (2025)
- Towards Revealing the Effectiveness of Small-Scale Fine-tuning in R1-style Reinforcement Learning (2025)
- Unearthing Gems from Stones: Policy Optimization with Negative Sample Augmentation for LLM Reasoning (2025)
- Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math (2025)
- GRPO-LEAD: A Difficulty-Aware Reinforcement Learning Approach for Concise Mathematical Reasoning in Language Models (2025)
- A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper