UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench Paper • 2506.09289 • Published 28 days ago • 2
Dynamic Benchmarking of Reasoning Capabilities in Code Large Language Models Under Data Contamination Paper • 2503.04149 • Published Mar 6 • 6
CPRet: A Dataset, Benchmark, and Model for Retrieval in Competitive Programming Paper • 2505.12925 • Published May 19 • 2