Papers
arxiv:2504.21798

SWE-smith: Scaling Data for Software Engineering Agents

Published on Apr 30
· Submitted by john-b-yang on May 7
Authors:
,
,
,
,
,
,
,
,

Abstract

Despite recent progress in Language Models (LMs) for software engineering, collecting training data remains a significant pain point. Existing datasets are small, with at most 1,000s of training instances from 11 or fewer GitHub repositories. The procedures to curate such datasets are often complex, necessitating hundreds of hours of human labor; companion execution environments also take up several terabytes of storage, severely limiting their scalability and usability. To address this pain point, we introduce SWE-smith, a novel pipeline for generating software engineering training data at scale. Given any Python codebase, SWE-smith constructs a corresponding execution environment, then automatically synthesizes 100s to 1,000s of task instances that break existing test(s) in the codebase. Using SWE-smith, we create a dataset of 50k instances sourced from 128 GitHub repositories, an order of magnitude larger than all previous works. We train SWE-agent-LM-32B, achieving 40.2% Pass@1 resolve rate on the SWE-bench Verified benchmark, state of the art among open source models. We open source SWE-smith (collection procedure, task instances, trajectories, models) to lower the barrier of entry for research in LM systems for automated software engineering. All assets available at https://swesmith.com.

Community

Paper author Paper submitter

Cracked 40% on SWE-bench verified (single attempt, no verifiers) with 100% open source models, agent, and data.

We know that finetuning & RL are very promising for training LMs as coding agents -- for SWE tasks like SWE-bench, the bottleneck is where to get the training data.

We've filled this gap with SWE-smith, a toolkit for synthesizing 100s to 1000s of task instances for any Python repo.

We've generated 50k+ task instances for 128 popular GitHub repositories, then trained our own LM for SWE-agent. SWE-agent-LM-32B achieves 40% pass@1 on SWE-bench Verified -- a new SoTA among open source models.

Check out

  • Paper (arXiv version is pretty much the same with some minor corrections, uploading a v2 soon).
  • Code
  • Website

Sign up or log in to comment

Models citing this paper 3

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2504.21798 in a Space README.md to link it from this page.

Collections including this paper 3