Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning
Abstract
Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.
Community
Most RL for LLMs focuses on single-turn tasks with no real environment feedback — unlike real-world problems like software engineering (SWE), which require multi-turn interaction.
We apply RL to this harder setting using a modified DAPO algorithm and train a Qwen2.5-72B-Instruct agent — no teacher models, just interaction.
Our agent doubles the success rate of a rejection-tuned baseline (20% → 39%) on SWE-bench Verified
and matches or beats top open models on SWE-rebench.
This shows that RL can unlock stronger autonomous agents in stateful, real-world environments — beyond static prompts and toy tasks.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Agent-RLVR: Training Software Engineering Agents via Guidance and Environment Rewards (2025)
- A Simple"Try Again"Can Elicit Multi-Turn LLM Reasoning (2025)
- L0: Reinforcement Learning to Become General Agents (2025)
- SWE-Dev: Building Software Engineering Agents with Training and Inference Scaling (2025)
- How to Train Your LLM Web Agent: A Statistical Diagnosis (2025)
- Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning (2025)
- ReVeal: Self-Evolving Code Agents via Iterative Generation-Verification (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
This work explores RL for SWE-agents, w/o any external teacher model, pure RL -- and we achieve x2 boost in resolved rate
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper