arxiv:2507.12367

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Published on Jul 16

· Submitted by

Authors:

Abstract

GitChameleon is a dataset for evaluating version-conditioned code generation by large language models, LLM-powered agents, code assistants, and RAG systems using execution-based tests.

AI-generated summary

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

View arXiv page View PDF Add to collection

Community

Xa9aX

Paper submitter about 5 hours ago

GitChameleon provides a novel version conditioned code generation evaluation harness for LLMs wherein we demonstrate all LLMs and AI code assisting frameworks (agents, RAG, CLI/ IDE agents) fail at generating correct simple functional version specific code for top python libraries while having all versions in distribution.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.12367 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.12367 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.12367 in a Space README.md to link it from this page.