SweRank: Software Issue Localization with Code Ranking
Abstract
Software issue localization, the task of identifying the precise code locations (files, classes, or functions) relevant to a natural language issue description (e.g., bug report, feature request), is a critical yet time-consuming aspect of software development. While recent LLM-based agentic approaches demonstrate promise, they often incur significant latency and cost due to complex multi-step reasoning and relying on closed-source LLMs. Alternatively, traditional code ranking models, typically optimized for query-to-code or code-to-code retrieval, struggle with the verbose and failure-descriptive nature of issue localization queries. To bridge this gap, we introduce SweRank, an efficient and effective retrieve-and-rerank framework for software issue localization. To facilitate training, we construct SweLoc, a large-scale dataset curated from public GitHub repositories, featuring real-world issue descriptions paired with corresponding code modifications. Empirical results on SWE-Bench-Lite and LocBench show that SweRank achieves state-of-the-art performance, outperforming both prior ranking models and costly agent-based systems using closed-source LLMs like Claude-3.5. Further, we demonstrate SweLoc's utility in enhancing various existing retriever and reranker models for issue localization, establishing the dataset as a valuable resource for the community.
Community
Excited to announce SWERank, our code ranking framework for software issue localization.
Paper: https://bit.ly/3S0x1fV
GitHub Project Page: https://bit.ly/42SESm3
AI-Generated Podcast: https://bit.ly/3GMF51H
Code, Data and Models: Coming soon!
Pinpointing the exact location of a software issue in code is a critical but often time-consuming part of software development. Current agentic approaches to localization can be slow and expensive, relying on complex steps and often closed-source models.
We introduce SWERank, a retrieve-and-rerank framework, that comprises SWERankEmbed, a bi-encoder code retriever and SWERankLLM, a listwise LLM code reranker.
SWERank is significantly more cost-effective and considerably more performant than other Agent-based approaches. Our 7B SweRankEmbed retriever even outperforms LocAgent running with Claude-3.5!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs (2025)
- CoSIL: Software Issue Localization via LLM-Driven Code Repository Graph Searching (2025)
- SWE-Synth: Synthesizing Verifiable Bug-Fix Data to Enable Large Language Models in Resolving Real-World Bugs (2025)
- OpenCodeInstruct: A Large-scale Instruction Tuning Dataset for Code LLMs (2025)
- Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving (2025)
- RTLRepoCoder: Repository-Level RTL Code Completion through the Combination of Fine-Tuning and Retrieval Augmentation (2025)
- What to Retrieve for Effective Retrieval-Augmented Code Generation? An Empirical Study and Beyond (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper