R1-Code-Interpreter: Training LLMs to Reason with Code via Supervised and Reinforcement Learning
Abstract
R1-Code-Interpreter extends text-only LLMs with improved code generation abilities through supervised fine-tuning and reinforcement learning, enhancing performance on diverse reasoning and planning tasks.
Despite advances in reasoning and planning of R1-like models, Large Language Models (LLMs) still struggle with tasks requiring precise computation, symbolic manipulation, optimization, and algorithmic reasoning, in which textual reasoning lacks the rigor of code execution. A key challenge is enabling LLMs to decide when to use textual reasoning versus code generation. While OpenAI trains models to invoke a Code Interpreter as needed, public research lacks guidance on aligning pre-trained LLMs to effectively leverage code and generalize across diverse tasks. We present R1-Code-Interpreter, an extension of a text-only LLM trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL) to autonomously generate multiple code queries during step-by-step reasoning. We curate 144 reasoning and planning tasks (107 for training, 37 for testing), each with over 200 diverse questions. We fine-tune Qwen-2.5 models (3B/7B/14B) using various SFT and RL strategies, investigating different answer formats, reasoning vs. non-reasoning models, cold vs. warm starts, GRPO vs. PPO, and masked vs. unmasked code outputs. Unlike prior RL work on narrow domains, we find that Code Interpreter training is significantly harder due to high task diversity and expensive code execution, highlighting the critical role of the SFT stage. Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.0\% to 64.1\%, outperforming GPT-4o (text-only: 58.6\%) and approaching GPT-4o with Code Interpreter (70.9\%), with the emergent self-checking behavior via code generation. Datasets, Codes, and Models are available at https://github.com/yongchao98/R1-Code-Interpreter and https://huggingface.co/yongchao98.
Community
We present a framework that integrates a Code Interpreter into LLM reasoning and planning using supervised and reinforcement learning. Our fine-tuned model, R1-CI-14B, outperforms GPT-4o without a Code Interpreter and nears its performance with one. While adapting GRPO improves capabilities, its impact is limited by task diversity, highlighting the importance of a strong base model and supervised fine-tuning. To our knowledge, this is the first open-source, general-purpose Code Interpreter trained with these methods.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- ReTool: Reinforcement Learning for Strategic Tool Use in LLMs (2025)
- Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning (2025)
- Learning to Reason without External Rewards (2025)
- ToolRL: Reward is All Tool Learning Needs (2025)
- Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning (2025)
- Enigmata: Scaling Logical Reasoning in Large Language Models with Synthetic Verifiable Puzzles (2025)
- X-Reasoner: Towards Generalizable Reasoning Across Modalities and Domains (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper