Papers
arxiv:2412.05210

Evaluating and Aligning CodeLLMs on Human Preference

Published on Dec 6
ยท Submitted by CSJianYang on Dec 11
#2 Paper of the day
Authors:
,
,
,
,
,

Abstract

Code large language models (codeLLMs) have made significant strides in code generation. Most previous code-related benchmarks, which consist of various programming exercises along with the corresponding test cases, are used as a common measure to evaluate the performance and capabilities of code LLMs. However, the current code LLMs focus on synthesizing the correct code snippet, ignoring the alignment with human preferences, where the query should be sampled from the practical application scenarios and the model-generated responses should satisfy the human preference. To bridge the gap between the model-generated response and human preference, we present a rigorous human-curated benchmark CodeArena to emulate the complexity and diversity of real-world coding tasks, where 397 high-quality samples spanning 40 categories and 44 programming languages, carefully curated from user queries. Further, we propose a diverse synthetic instruction corpus SynCode-Instruct (nearly 20B tokens) by scaling instructions from the website to verify the effectiveness of the large-scale synthetic instruction fine-tuning, where Qwen2.5-SynCoder totally trained on synthetic instruction data can achieve top-tier performance of open-source code LLMs. The results find performance differences between execution-based benchmarks and CodeArena. Our systematic experiments of CodeArena on 40+ LLMs reveal a notable performance gap between open SOTA code LLMs (e.g. Qwen2.5-Coder) and proprietary LLMs (e.g., OpenAI o1), underscoring the importance of the human preference alignment.\url{https://codearenaeval.github.io/ }

Community

Paper submitter

๐ŸŒŸ CodeArena: A Benchmark in Optimizing Code Generation and Enhancing User Experience ๐Ÿš€
As developers' reliable assistants, CodeLLMs must generate code that not only meets technical requirements but also focuses on developers' intuitive experience.
To this end, this paper introduces , CodeArena, currently the most comprehensive benchmark for evaluating CodeLLMs' alignment with human preferences, and SynCode-Instruct, a high-quality, large-scale code-text synthesis instruction corpus, marking a major leap in code generation technology for user experience.

CodeArena's Major Achievements: ๐Ÿ†

Real-world Challenges โš”: CodeArena carefully selected 397 high-quality samples from actual user queries through rigorous manual annotation and quality control, covering 40 task scenarios and 44 common programming languages. Compared to other benchmarks, it features more diverse problem distributions and complex real-world scenarios. 39 LLMs have been systematically evaluated.

Large-scale Corpus ๐Ÿ“˜: For highly relevant code-text pairs collected from code-related websites, Qwen2.5-72B was used to generate better code or text snippets, followed by code sandbox filtering and other large model scoring screening, ultimately resulting in 20B tokens of learning material, SynCode-Instruct.

User Preference Oriented ๐Ÿ’ก: SynCoder is obtained by fine-tuning Qwen2.5-Coder-32B on the SynCode-Instruct corpus. The two-stage training process highlighted the significant improvement that high-quality datasets bring to models, ultimately narrowing the significant performance gap between open-source and closed-source models in both traditional programming tasks and CodeArena.

Github: https://github.com/QwenLM/Qwen2.5-Coder/tree/main/qwencoder-eval/instruct/CodeArena
arxiv: https://arxiv.org/abs/2412.05210
hf: https://huggingface.co/datasets/CSJianYang/CodeArena

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2412.05210 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2412.05210 in a Space README.md to link it from this page.

Collections including this paper 7