midashenglm-7b / README.md

Update README.md

485b8de verified 30 days ago

22.6 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	- th
	- id
	- vi
	pipeline_tag: audio-text-to-text
	tags:
	- multimodal
	- audio-language-model
	- audio
	base_model:
	- mispeech/dasheng-0.6B
	- Qwen/Qwen2.5-Omni-7B
	base_model_relation: finetune
	---

	<div align="center">
	<h1>
	MiDashengLM
	</h1>
	<b><em>Efficient audio understanding with general audio captions</em></b></em></b>
	<p>
	</p>
	<a href="https://arxiv.org/abs/2508.03983"><img src="https://img.shields.io/badge/arXiv-2508.03983-b31b1b" alt="version"></a>
	<a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a>
	<a href="https://modelscope.cn/models/midasheng/midashenglm-7b"><img src="https://img.shields.io/badge/ModelScope-7B-7448ce" alt="version"></a>
	<a href="https://modelscope.cn/studios/midasheng/MiDashengLM-7B"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a>
	<a href="https://xiaomi-research.github.io/dasheng-lm/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a>
	</div>

	## 🔥 Key Highlights

	State-of-the-Art Performance
	- Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on multiple key audio understanding tasks.

	High Efficiency
	- 3.2× throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B.
	- 20x throughput speedup by increasing furhter batchsizes. We tested up to a batch size=512 for 30s audio input on 80GB GPUs. Baselines only support batch size = 8.
	- Time-to-first-token (TTFT) speedup of up to 4x compared to Qwen2.5-Omni-7B.

	Caption-based Alignment
	- Trained with general audio captions (instead of ASR transcripts) to achieve holistic audio understanding.

	Full Transparency
	- Public-source training data and reproducible pipeline.
	- Apache License 2.0 for both research and commercial use.

	<div align="center">
	<img src="fig/capabilities_plot_7b-1.png" width="600">
	</div>

	## Acknowledgment and Model Foundation

	Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models,
	we acknowledge Qwen2.5-Omni as a remarkable and respected foundational work in the field.
	Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the initialization for decoder training, building upon its robust architecture and weight initialization.

	The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance.
	Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance.

	## Framework

	MiDashengLM integrates the powerful Dasheng audio encoder with
	the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy.
	Unlike conventional ASR-driven approaches,
	our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements
	in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency.

	<img src="fig/Framework-1.png" width="800">

	### Why Captions Instead of ASR?

	ASR Limitations:
	- Discards huge amount of non-speech audio (music/environmental sounds).
	- Misses paralinguistic info (speaker emotion, acoustic properties).
	- Monotonic alignment provides trivial learning signal.

	Caption Advantages:
	- Utilizes all audio content.
	- Captures global audio context.
	- Non-monotonic alignment provides a hard learning signal.

	### Novel Open Source Dataset for Training: ACAVCaps

	ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/).
	While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding.
	We devide the dataset into six categories:

	\| Category \| Example Caption \|
	\|----------\|-----------------\|
	\| Pure Speech \| "A female voice narrates historical competition with synthetic modulation" \|
	\| Pure Sound \| "Outdoor scene with wind, birds, duck quacking and background noise" \|
	\| Pure Music \| "Crowd cheering with electronic synthesizer-driven soundscape" \|
	\| Mixed Music \| "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." \|
	\| Mixed Speech \| "A Russian voice demonstrates a synthesizer’s capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." \|
	\| Mixed Sound \| "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." \|

	The figure below illustrates our data curation pipeline for ACAVCaps:

	<img src="fig/acavcaps-1.png" width="800">

	Each caption is generated through a three-step process:

	1. Multi-expert analysis (speech, vocal, music, acoustics)
	2. LLM reasoning synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)
	3. Filtering for audio-text consistency with [Dasheng-GLAP](https://github.com/xiaomi-research/dasheng-glap)

	We will release the ACAVCaps dataset after the ICASSP 2026 review process.

	## Usage

	### Load Model

	```python
	from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer

	model_id = "mispeech/midashenglm-7b"

	model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
	```

	### Construct Prompt

	```python
	user_prompt = "Caption the audio." # You may try any other prompt

	messages = [
	{
	"role": "system",
	"content": [
	{"type": "text", "text": "You are a helpful language and speech assistant."}
	],
	},
	{
	"role": "user",
	"content": [
	{"type": "text", "text": user_prompt},
	{
	"type": "audio",
	"path": "/path/to/example.wav",
	# or "url": "https://example.com/example.wav"
	# or "audio": np.random.randn(16000)
	},
	],
	},
	]
	```

	### Generate Output

	```python
	import torch

	with torch.no_grad():
	model_inputs = processor.apply_chat_template(
	messages,
	tokenize=True,
	add_generation_prompt=True,
	add_special_tokens=True,
	return_dict=True,
	)
	generation = model.generate(**model_inputs)
	output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."]
	```

	## Results

	MiDashengLM delivers solid performance across diverse audio understanding tasks.

	### Audio Captioning Results

	\| Domain \| Dataset \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:--------:\|:--------------:\|:--------------:\|:----------------:\|:-------------------:\|
	\| Music \| MusicCaps \| 59.71 \| 43.71 \| 35.43 \|
	\| Music \| Songdescriber \| 45.39 \| 45.31 \| 44.63 \|
	\| Sound \| AudioCaps \| 62.18 \| 60.79 \| 49.00 \|
	\| Sound \| ClothoV2 \| 49.20 \| 47.55 \| 48.01 \|
	\| Sound \| AutoACD \| 66.52 \| 55.93 \| 44.76 \|

	Metrics: FENSE (higher is better).

	### Audio and Paralinguistic Classification

	\| Dataset \| Metric \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:----------------:\|:------:\|:--------------:\|:----------------:\|:------------------:\|
	\| VoxCeleb1 \| ACC↑ \| 92.36 \| 59.71 \| 82.72 \|
	\| VoxLingua107 \| ACC↑ \| 93.41 \| 51.03 \| 73.65 \|
	\| VoxCeleb-Gender \| ACC↑ \| 96.12 \| 99.82 \| 99.69 \|
	\| VGGSound \| ACC↑ \| 52.11 \| 0.97 \| 2.20 \|
	\| Cochlscene \| ACC↑ \| 74.06 \| 23.88 \| 18.34 \|
	\| NSynth \| ACC↑ \| 80.52 \| 60.45 \| 38.09 \|
	\| FMA \| ACC↑ \| 63.73 \| 66.77 \| 27.91 \|
	\| FSDKaggle2018 \| ACC↑ \| 75.25 \| 31.38 \| 24.75 \|
	\| AudioSet \| mAP↑ \| 8.86 \| 6.48 \| 3.47 \|
	\| FSD50K \| mAP↑ \| 37.58 \| 23.87 \| 27.23 \|

	### ASR Performance

	\| Dataset \| Language \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:------------------:\|:-----------:\|:--------------:\|:------------:\|:-------------------:\|
	\| LibriSpeech test-clean \| English \| 3.7 \| 1.7 \| 1.3 \|
	\| LibriSpeech test-other \| English \| 6.2 \| 3.4 \| 2.4 \|
	\| People's Speech \| English \| 27.8 \| 28.6 \| 22.3 \|
	\| AISHELL2 Mic \| Chinese \| 3.2 \| 2.5 \| 2.7 \|
	\| AISHELL2 iOS \| Chinese \| 2.9 \| 2.6 \| 2.6 \|
	\| AISHELL2 Android \| Chinese \| 3.1 \| 2.7 \| 2.6 \|
	\| GigaSpeech2 \| Indonesian \| 20.8 \| 21.2 \| >100 \|
	\| GigaSpeech2 \| Thai \| 36.9 \| 53.8 \| >100 \|
	\| GigaSpeech2 \| Viet \| 18.1 \| 18.6 \| >100 \|

	Metrics: WER/CER (lower is better).

	### Question Answering Results

	\| Dataset \| Subset \| Metric \| MiDashengLM \| Qwen2.5-Omni-7B \| Kimi-Audio-Instruct \|
	\|:------------:\|:-------:\|:------:\|:--------------:\|:----------------:\|:-------------------:\|
	\| MuChoMusic \| \| ACC↑ \| 71.35 \| 64.79 \| 67.40 \|
	\| MMAU \| Sound \| ACC↑ \| 68.47 \| 67.87 \| 74.17 \|
	\| MMAU \| Music \| ACC↑ \| 66.77 \| 69.16 \| 61.08 \|
	\| MMAU \| Speech \| ACC↑ \| 63.66 \| 59.76 \| 57.66 \|
	\| MMAU \| Average \| ACC↑ \| 66.30 \| 65.60 \| 64.30 \|
	\| MusicQA \| \| FENSE↑ \| 62.35 \| 60.60 \| 40.00 \|
	\| AudioCaps-QA \| \| FENSE↑ \| 54.31 \| 53.28 \| 47.34 \|

	Metrics: Higher is better.

	### Reproduction Instructions

	To reproduce our results, we provide:

	- Prompts ([prompt.csv](evaluate/prompt.csv))
	- Evaluation scripts
	- Example JSONL files

	#### 1. Install Dependencies for Evaluation (No need this for inference)

	```bash
	pip install -r requirements.txt
	```

	#### 2. Generate Model Outputs

	Generate responses using the model's official framework with prompts from [prompt.csv](evaluate/prompt.csv).

	#### 3. Convert Outputs to JSONL Format

	Format model outputs using the [example JSONL](evaluate/jsonl) files:

	\| Task \| Example File \|
	\|------\|--------------\|
	\| Automatic Speech Recognition \| [MiDashengLM_LibriSpeech_test-clean.jsonl](evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl) \|
	\| Single-target Audio Tagging \| [MiDashengLM_NSynth.jsonl](evaluate/jsonl/MiDashengLM_NSynth.jsonl) \|
	\| Gender Recognition \| [MiDashengLM_VoxCeleb-Gender.jsonl](evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl) \|
	\| Multi-target Audio Tagging \| [MiDashengLM_FSD50K.jsonl](evaluate/jsonl/MiDashengLM_FSD50K.jsonl) \|
	\| Audio Captioning \| [MiDashengLM_AutoACD.jsonl](evaluate/jsonl/MiDashengLM_AutoACD.jsonl) \|
	\| Open Audio Question Answering \| [MiDashengLM_MusicQA.jsonl](evaluate/jsonl/MiDashengLM_MusicQA.jsonl) \|
	\| Audio QA with Options \| [MiDashengLM_MuChoMusic.jsonl](evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl) \|

	#### 4. Evaluate Results

	Execute the corresponding evaluation scripts:

	```bash
	# Automatic Speech Recognition (WER)
	# Uses: lang, text, model_output
	python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl

	# Single-target Audio Tagging (ACC)
	# Uses: label, model_output
	python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl

	# Gender Recognition (ACC)
	# Uses: label, model_output
	python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl

	# Multi-target Audio Tagging (mAP)
	# Uses: dataset_name, label, model_output, model_name
	python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl

	# Audio Captioning (FENSE)
	# Uses: audio, text, model_output
	python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl

	# Open Audio QA (FENSE)
	# Uses: audio, answer, model_output
	python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl

	# Audio QA with Options (ACC)
	# Uses: answer, model_output
	python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl
	```

	#### 5. Evaluate on MECAT and MMAU benchmarks

	Please refer to the official repositories for evaluation on the [MECAT](https://github.com/xiaomi-research/mecat)
	and [MMAU](https://github.com/Sakshi113/mmau) benchmarks.

	## Efficiency

	MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B,
	achieving 3.2× speedup at comparable batch sizes and an overall potential speedup of 20.2× with larger batches.

	<img src="fig/batchsize_1_comparison_7b-1.png" width="800">

	\| Batch Size \| MiDashengLM (samples/s) \| Qwen2.5-Omni-7B (samples/s) \| Speedup \|
	\|:----------:\|:-----------------------:\|:----------------------------:\|:-------:\|
	\| 1 \| 0.45 \| 0.36 \| 1.25x \|
	\| 4 \| 1.40 \| 0.91 \| 1.53x \|
	\| 8 \| 2.72 \| 1.15 \| 2.36x \|
	\| 16 \| 5.18 \| OOM \| - \|
	\| 32 \| 9.78 \| OOM \| - \|
	\| 64 \| 17.07 \| OOM \| - \|
	\| 128 \| 22.73 \| OOM \| - \|
	\| 200 \| 25.15 \| OOM \| - \|

	Tested on 80GB GPU with 30s audio, 100-token output.

	## Training Data

	MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage.

	<img src="fig/pretraining_sampling_rates-1.png" width="1200">

	### Speech Training Data

	This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT).
	The column “SFT?” indicates whether the dataset is used for supervised fine-tuning.

	\| Data \| Task \| Length(h) \| SFT? \|
	\|:----------------------:\|:---------:\|:---------:\|:----:\|
	\| LibriSpeech \| ASR \| 960 \| √ \|
	\| LibriHeavy \| ASR \| 50,000 \| X \|
	\| GigaSpeech \| ASR \| 10,000 \| √ \|
	\| GigaSpeech2 \| ASR \| 30,000 \| √ \|
	\| WeNetSpeech \| ASR \| 10,000 \| √ \|
	\| Yodas \| ASR \| 320,000 \| X \|
	\| CommonVoice-17.0 \| ASR \| 5,000 \| √ \|
	\| AISHELL-1 \| ASR \| 100 \| √ \|
	\| AISHELL-2 \| ASR \| 1,000 \| √ \|
	\| AISHELL-3 \| ASR \| 70 \| √ \|
	\| LJSpeech-1.1 \| ASR \| 37 \| X \|
	\| LibriTTS \| ASR \| 585 \| X \|
	\| MultiLingualSpokenWords\| KWS \| 5,000 \| X \|
	\| Emilia \| ASR \| 101,000 \| √ \|
	\| CovoST-v2 \| S2TT \| 2,880 \| √ \|
	\| Fleurs \| S2TT \| 1,224 \| X \|
	\| MSR-86K \| ASR, LangID\| 86,000 \| √ \|
	\| ACAV100M-Speech \| ASR \| 55,754 \| X \|
	\| Must-C \| ASR,S2TT \| 1,000 \| √ \|
	\| MLS \| ASR \| 50,000 \| X \|
	\| SpgiSpeech \| ASR \| 5,000 \| X \|
	\| PeoplesSpeech \| ASR \| 30,000 \| X \|
	\| KeSpeech \| ASR \| 1,400 \| √ \|
	\| LAION-300M \| Caption \| 230,000 \| X \|
	\| Total \| \| 997,010\| 258.410 \|

	### Sound and General Audio Datasets

	\| Dataset \| Task \| Length(h) \| SFT? \|
	\|:--------------:\|:------------------------:\|:---------:\|:----:\|
	\| FSD50k \| Sound Event \| 77 \| √ \|
	\| AudioSet \| Sound Event \| 5,200 \| \|
	\| AudioSet-strong\| Sound Event \| 220 \| X \|
	\| VGGSound \| Sound Event \| 540 \| √ \|
	\| FSDKaggle2018 \| Sound Event \| 20 \| √ \|
	\| FSDKaggle2019 \| Sound Event \| 100 \| \|
	\| ARCA23k \| Sound Event \| 120 \| X \|
	\| AutoACD \| Audio(Sound) Caption \| 5,200 \| √ \|
	\| AudioSetCaps \| Audio(Sound) Caption \| 6,000 \| √ \|
	\| SoundVECaps \| Audio(Sound) Caption \| 5,000 \| √ \|
	\| WavCaps \| Audio(Sound) Caption \| 7,567 \| √ \|
	\| Audiocaps \| Audio(Sound) Caption \| 100 \| √ \|
	\| Clothov2 \| Audio(Sound) Caption \| 17 \| √ \|
	\| TACOS \| Audio(Sound) Caption \| 98 \| √ \|
	\| CochlScene \| SoundScape \| 500 \| √ \|
	\| BirdSet \| SoundScape \| 7,000 \| X \|
	\| ACAVCaps \| General Caption \| 38,662 \| √ \|
	\| Total \| \| 76.421\| 69.081 \|

	### Speech and Paralinguistic Datasets

	\| Dataset \| Task \| Length(hours) \| SFT? \|
	\|:------------------:\|:-----------------------------:\|:-------------:\|:----:\|
	\| IEMOCAP \| Emotion \| 8 \| √ \|
	\| Meld \| Emotion \| 12 \| √ \|
	\| SUBESCO \| Emotion \| 9 \| X \|
	\| RAVDESS-Speech \| Emotion \| 2 \| X \|
	\| RAVDESS-Song \| Emotion \| 1 \| X \|
	\| CREMA-D \| Emotion \| 4 \| X \|
	\| ESD \| Emotion \| 29 \| X \|
	\| VocalSound \| Vocal sound classification \| 20 \| √ \|
	\| NonSpeech7k \| Vocal sound classification \| 3 \| √ \|
	\| VoxLingua107 \| Language identification \| 7,200 \| √ \|
	\| CommonLanguage \| Language identification \| 45 \| √ \|
	\| YLACombe \| Language identification \| 5 \| X \|
	\| VoxCeleb1 \| Speaker verification \| 76 \| √ \|
	\| CNCeleb \| Speaker verification & age \| 2,100 \| √ \|
	\| VoxCeleb2 \| Speaker verification \| 1,000 \| √ \|
	\| VoxBlink1 \| Speaker verification \| 1,300 \| \|
	\| VoxBlink2 \| Speaker verification \| 2,600 \| √ \|
	\| VoxTube \| Language identification \| 5,200 \| √ \|
	\| LibriCount \| Speaker counting \| 8 \| √ \|
	\| FluentSpeechCommands \| Intent classification & gender \| 17 \| X \|
	\| SpeechOcean762 \| Speaker age \| 5 \| X \|
	\| ASVSpoof5 \| Spoof detection \| 603 \| X \|
	\| Total \| \| 20,247 \| 19,572 \|

	### Music-Related Datasets

	Covers music captioning, genre recognition, instrument classification, and singing style identification.

	\| Dataset \| Task \| Length(h) \| SFT? \|
	\|:---------------:\|:---------------------------------:\|:---------:\|:----:\|
	\| MusicCaps \| Music Caption \| 15 \| √ \|
	\| Songdescriber \| Music Caption \| 23 \| √ \|
	\| LPMusicCaps-MTT \| Music Caption \| 18 \| √ \|
	\| LPMusicCaps-MSD \| Music Caption \| 1,000 \| √ \|
	\| VocalSet \| Singing style identification \| 10 \| X \|
	\| FreeMusicArchive\| Genre recognition \| 610 \| √ \|
	\| MTG-Jamendo \| Instrument classification Genre recognition \| 3,768 \| √ \|
	\| NSynth \| Instrument classification \| 360 \| √ \|
	\| GoodSounds \| Instrument classification \| 28 \| √ \|
	\| chMusic \| Instrument classification \| 1 \| √ \|
	\| CTIS \| Instrument classification \| 1 \| √ \|
	\| Total \| \| 5,824 \| 5,814 \|

	### Question Answering Datasets

	Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT.

	\| Dataset \| Task \| # QA \| SFT? \|
	\|:---------:\|:---------------:\|:--------:\|:----:\|
	\| AVQA \| Environment QA \| 36,114 \| √ \|
	\| ClothoAQA \| Environment QA \| 6,175 \| √ \|
	\| TACOS+ \| Environment QA \| 40,019 \| √ \|
	\| MusicQA \| Music QA \| 112,878 \| √ \|
	\| SIFT-50M \| Speech QA \| 21,430,000 \| √ \|
	\| ACAV-QA \| General QA \| 24,371 \| √ \|

	## Citation

	MiDashengLM is under the Apache License 2.0, and we encourage its use in both research and business applications.

	If you find MiDashengLM useful in your research, please consider citing our work:

	```bibtex
	@techreport{midashenglm7b,
	title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions},
	author = {{Horizon Team, MiLM Plus}},
	institution= {Xiaomi Inc.},
	year = {2025},
	note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)},
	url = {https://arxiv.org/abs/2508.03983},
	eprint = {2508.03983},
	}
	```