CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following
Abstract
CMI-Bench introduces a comprehensive instruction-following benchmark for audio-text LLMs to evaluate them on a diverse range of music information retrieval tasks.
Recent advances in audio-text large language models (LLMs) have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.
Community
📄Arxiv, accepted by 26th conference of International Society for Music Information Retrieval (ISMIR 2025) 🎉
🖥️GitHub
🤗Dataset, testset audio under CC-BY-NC-SA4.0 license
- Comprehensive Task Coverage: CMI-Bench includes 14 diverse music information retrieva (MIR) tasks, moving beyond simple classification to include regression, captioning, and complex sequential tasks.
- Standardised Evaluation: Unlike previous benchmarks that rely on multiple-choice questions, CMI-Bench employs open-ended, task-specific metrics aligned with the MIR literature (e.g., using mir_eval), allowing for direct comparison with traditional supervised models.
- Evaluation Toolkit: We provide a full evaluation toolkit that supports all major open-source audio-textual LLMs, enabling standardised and reproducible benchmarking.
- In-depth Analysis: The benchmark facilitates a deeper analysis of model capabilities, including generalisation, prompt sensitivity, and biases related to culture and gender.
Yinghao Ma is a research student at the UKRI Centre for Doctoral Training in Artificial Intelligence and Music, supported by UK Research and Innovation [grant number EP/S022694/1]. Emmanouil Benetos is supported by a RAEng/Leverhulme Trust Research Fellowship [grant number LTRF2223-19-106].
Yinghao Ma would also like to express heartfelt gratitude to the Student Philharmonic Chinese Orchestra at the Chinese Music Institute, Peking University (abbreviated as CMI, unrelated to the paper title). We warmly celebrate the orchestra’s 20th anniversary
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix (2025)
- X-ARES: A Comprehensive Framework for Assessing Audio Encoder Performance (2025)
- From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data (2025)
- IFEval-Audio: Benchmarking Instruction-Following Capability in Audio-based Large Language Models (2025)
- TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment (2025)
- U-SAM: An audio language Model for Unified Speech, Audio, and Music Understanding (2025)
- Kimi-Audio Technical Report (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper