arxiv:2508.03501

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Published on Aug 5

· Submitted by

sbkarasik on Aug 7

Upvote

Authors:

Alexander Golubev ,

Ibragim Badertdinov ,

Simon Karasik ,

Abstract

Research on applications of Reinforcement Learning (RL) to Large Language Models (LLMs) has mostly been focused on single-turn problems, such as mathematical reasoning or single-shot code generation. While these problems can be viewed as token-level multi-turn MDPs, this view corresponds to a degenerate case of multi-turn interaction where the environment provides no feedback. This contrasts with many real-world domains, such as software engineering (SWE), which require rich multi-turn interactions with a stateful environment that responds to each action with a non-trivial observation. To bridge this gap, we demonstrate the successful application of RL to this general regime. Using a modified Decoupled Advantage Policy Optimization (DAPO) algorithm, we train an agent based on Qwen2.5-72B-Instruct to solve real-world software engineering tasks. Our approach increases the agent's success rate on the SWE-bench Verified benchmark from a 20% rejection fine-tuned baseline to 39%, without relying on any teacher models. On SWE-rebench, our agent matches or outperforms leading open-weight models such as DeepSeek-V3-0324 and Qwen3-235B-A22B using an identical scaffolding, offering a viable path toward building more capable autonomous agents for complex real-world problems based on open models.

View arXiv page View PDF Add to collection

Community

vim-ary

1 day ago

Most RL for LLMs focuses on single-turn tasks with no real environment feedback — unlike real-world problems like software engineering (SWE), which require multi-turn interaction.

We apply RL to this harder setting using a modified DAPO algorithm and train a Qwen2.5-72B-Instruct agent — no teacher models, just interaction.

Our agent doubles the success rate of a rejection-tuned baseline (20% → 39%) on SWE-bench Verified

and matches or beats top open models on SWE-rebench.

This shows that RL can unlock stronger autonomous agents in stateful, real-world environments — beyond static prompts and toy tasks.

vim-ary

1 day ago

This comment has been hidden (marked as Off-Topic)

librarian-bot

1 day ago

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

sbkarasik

Paper author Paper submitter 1 day ago

•

edited 1 day ago

This work explores RL for SWE-agents, w/o any external teacher model, pure RL -- and we achieve x2 boost in resolved rate

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2508.03501 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2508.03501 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2508.03501 in a Space README.md to link it from this page.

Training Long-Context, Multi-Turn Software Engineering Agents with Reinforcement Learning

Abstract

Community

Models citing this paper 0

Datasets citing this paper 0

Spaces citing this paper 0

Collections including this paper 1