--- license: mit --- # Audio-Reasoner We implemented inference scaling on **Audio-Reasoner**, a large audio language model, enabling **deepthink** and **structured chain-of-thought (COT) reasoning** for multimodal understanding and reasoning. To achieve this, we constructed CoTA, a high-quality dataset with **1.2M reasoning-rich samples** using structured COT techniques. Audio-Reasoner achieves state-of-the-art results on **MMAU-mini(+25.42%)** and **AIR-Bench-Chat(+14.57%)** benchmarks.

Audio-Reasoner-7B 🤗 | CoTA Dataset 🤗 (coming soon)
Paper 📑 ｜ Wechat 💭 ｜ Code ⚙️
Demo • Install • Quick Start • FAQ • Contact us

If you like us, pls give us a star⭐ !

## Main Results ## News and Updates - **2025.03.05:** ✅**Audio-Reasoner-7B checkpoint is released on HuggingFace🤗 !** - **2025.03.05:** ✅**Audio-Reasoner Paper is uploaded to arXiv 📑.** - **2025.03.04:** ✅**Demos, inference code and evaluation results have been released.** - **2025.03.04:** ✅**Create this repo.** ## Roadmap - **2025.03:** **🔜Upload CoTA dataset to HuggingFace🤗.** - **2025.04:** **🔜Open-source data systhesis pipeline and training code**. ## Demo

## Features ✅ Audio-Reasoner enables **deep reasoning and inference scaling** in audio-based tasks, built on Qwen2-Audio-Instruct with structured CoT training. ✅ CoTA offers **1.2M** high-quality captions and QA pairs across domains for structured reasoning and enhanced pretraining. ✅ Pretrained model and dataset encompassing various types of audio including sound, music, and speech, has achieved state-of-the-art results across multiple benchmarks. Refer to our paper for details. ## Install **Clone and install** - Clone the repo ``` sh git clone https://github.com/xzf-thu/Audio-Reasoner.git cd Audio-Reasoner ``` - Install the required packages ```sh conda create -n Audio-Reasoner python=3.10 conda activate Audio-Reasoner pip install -r requirements.txt pip install transformers==4.49.1 ``` ## Quick Start **Chat using ms-swift** ```sh import os import re from typing import List, Literal from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, load_dataset, get_template from swift.plugin import InferStats def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'): request_config = RequestConfig(max_tokens=2048, temperature=0, stream=True) metric = InferStats() gen = engine.infer([infer_request], request_config, metrics=[metric]) query = infer_request.messages[0]['content'] output = "" print(f'query: {query}\nresponse: ', end='') for resp_list in gen: if resp_list[0] is None: continue print(resp_list[0].choices[0].delta.content, end='', flush=True) output += resp_list[0].choices[0].delta.content print() print(f'metric: {metric.compute()}') return output def get_message(audiopath, prompt): messages = [ {"role": "system", "content": system}, { 'role': 'user', 'content': [{ 'type': 'audio', 'audio': audiopath }, { 'type': 'text', 'text': prompt }] }] return messages system = 'You are an audio deep-thinking model. Upon receiving a question, please respond in two parts: and . The section should be further divided into four parts: , , , and

.' infer_backend = 'pt' model = 'qwen2_audio' last_model_checkpoint = "" #Please replace it with the path to checkpoint engine = PtEngine(last_model_checkpoint, max_batch_size=64, model_type = model) def audioreasoner_gen(audiopath, prompt): return infer_stream(engine, InferRequest(messages=get_message(audiopath, prompt))) def main(): #Please replace it with your test aduio audiopath = "assets/test.wav" #Please replace it with your questions about the test aduio prompt = "Which of the following best describes the rhythmic feel and time signature of the song?" audioreasoner_gen(audiopath, prompt) if __name__ == '__main__': main() ``` **Local test** ```sh conda activate Audio-Reasoner cd Audio-Reasoner # test run the preset audio samples and questions python inference.py ``` ## FAQ **1. What kind of audio can Audio - Reasoner understand and what kind of thinking does it perform?** Audio - Reasoner can understand various types of audio, including sound, music, and speech. It conducts in - depth thinking in four parts: **planning, caption, reasoning, and summary**. **2. Why is transformers installed after 'ms-swift' in the environment configuration?** The version of transformers has a significant impact on the performance of the model. We have tested that version `transformers==4.49.1` is one of the suitable versions. Installing ms-swift first may ensure a more stable environment for the subsequent installation of transformers to avoid potential version conflicts that could affect the model's performance. ## Contact If you have any questions, please feel free to contact us via `zhifei001@e.ntu.edu.sg`. ## Citation Please cite our paper if you find our model and detaset useful. Thanks! ``` @misc{xie2025audioreasonerimprovingreasoningcapability, title={Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models}, author={Zhifei Xie and Mingbao Lin and Zihang Liu and Pengcheng Wu and Shuicheng Yan and Chunyan Miao}, year={2025}, eprint={2503.02318}, archivePrefix={arXiv}, primaryClass={cs.SD}, url={https://arxiv.org/abs/2503.02318}, } ```