SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents
Abstract
A novel pipeline extracts real-world, interactive software engineering tasks from GitHub to create SWE-rebench, improving the evaluation of reinforcement learning models in SWE.
LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.
Community
Dataset: https://huggingface.co/datasets/nebius/SWE-rebench
SWE-rebench home page: https://swe-rebench.com
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (2025)
- SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs (2025)
- SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents (2025)
- OmniGIRL: A Multilingual and Multimodal Benchmark for GitHub Issue Resolution (2025)
- SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development (2025)
- SWE-smith: Scaling Data for Software Engineering Agents (2025)
- Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper