|
--- |
|
license: apache-2.0 |
|
language: |
|
- en |
|
- zh |
|
- th |
|
- id |
|
- vi |
|
pipeline_tag: audio-text-to-text |
|
tags: |
|
- multimodal |
|
- audio-language-model |
|
- audio |
|
base_model: |
|
- mispeech/dasheng-0.6B |
|
- Qwen/Qwen2.5-Omni-7B |
|
base_model_relation: finetune |
|
--- |
|
|
|
<div align="center"> |
|
<h1> |
|
MiDashengLM |
|
</h1> |
|
<b><em>Efficient audio understanding with general audio captions</em></b></em></b> |
|
<p> |
|
</p> |
|
<a href="https://arxiv.org/abs/2508.03983"><img src="https://img.shields.io/badge/arXiv-2508.03983-b31b1b" alt="version"></a> |
|
<a href="https://github.com/xiaomi-research/dasheng-lm"><img src="https://img.shields.io/badge/Homepage-GitHub-0366d6" alt="version"></a> |
|
<a href="https://modelscope.cn/models/midasheng/midashenglm-7b"><img src="https://img.shields.io/badge/ModelScope-7B-7448ce" alt="version"></a> |
|
<a href="https://modelscope.cn/studios/midasheng/MiDashengLM-7B"><img src="https://img.shields.io/badge/Demo-Gradio-ffcc66" alt="version"></a> |
|
<a href="https://xiaomi-research.github.io/dasheng-lm/"><img src="https://img.shields.io/badge/Demo-Page-0366d6" alt="version"></a> |
|
</div> |
|
|
|
## π₯ Key Highlights |
|
|
|
**State-of-the-Art Performance** |
|
- Outperforms Qwen2.5-Omni-7B, Kimi-Audio-Instruct-7B on **multiple key audio understanding tasks**. |
|
|
|
**High Efficiency** |
|
- **3.2Γ** throughput speedup at comparable batch sizes compared to Qwen2.5-Omni-7B. |
|
- **20x** throughput speedup by increasing furhter batchsizes. We tested up to a **batch size=512** for 30s audio input on 80GB GPUs. Baselines only support batch size = 8. |
|
- Time-to-first-token (TTFT) speedup of up to **4x** compared to Qwen2.5-Omni-7B. |
|
|
|
**Caption-based Alignment** |
|
- Trained with **general audio captions** (instead of ASR transcripts) to achieve holistic audio understanding. |
|
|
|
**Full Transparency** |
|
- **Public-source** training data and reproducible pipeline. |
|
- Apache License 2.0 for **both research and commercial use**. |
|
|
|
<div align="center"> |
|
<img src="fig/capabilities_plot_7b-1.png" width="600"> |
|
</div> |
|
|
|
## Acknowledgment and Model Foundation |
|
|
|
Although MiDashengLM demonstrates superior audio understanding performance and efficiency compared to Qwen2.5-Omni models, |
|
we acknowledge **Qwen2.5-Omni as a remarkable and respected foundational work** in the field. |
|
Our model specifically uses [Qwen2.5-Omni-7B Thinker](https://huggingface.co/Qwen/Qwen2.5-Omni-7B) as the initialization for decoder training, building upon its robust architecture and weight initialization. |
|
|
|
The audio encoder is built upon [Dasheng](https://github.com/XiaoMi/dasheng), an open-source audio encoder for general audio understanding with state-of-the-art performance. |
|
**Dasheng serves as the core foundation enabling MiDashengLM's exceptional performance**. |
|
|
|
## Framework |
|
|
|
MiDashengLM integrates the powerful Dasheng audio encoder with |
|
the Qwen2.5-Omni-7B Thinker decoder through a unique caption-based alignment strategy. |
|
Unlike conventional ASR-driven approaches, |
|
our model leverages general audio captions to capture comprehensive audio representations encompassing speech, environmental sounds, and musical elements |
|
in a unified textual format. This design enables holistic audio understanding while maintaining exceptional computational efficiency. |
|
|
|
<img src="fig/Framework-1.png" width="800"> |
|
|
|
### Why Captions Instead of ASR? |
|
|
|
ASR Limitations: |
|
- Discards huge amount of non-speech audio (music/environmental sounds). |
|
- Misses paralinguistic info (speaker emotion, acoustic properties). |
|
- Monotonic alignment provides trivial learning signal. |
|
|
|
Caption Advantages: |
|
- Utilizes all audio content. |
|
- Captures global audio context. |
|
- Non-monotonic alignment provides a hard learning signal. |
|
|
|
### Novel Open Source Dataset for Training: ACAVCaps |
|
|
|
ACAVCaps is a meticulously curated 38,662-hour collection of general audio captions derived from the open-source [ACAV100M audio repository](https://acav100m.github.io/). |
|
While leveraging ACAV100M's extensive raw audio materials, we completely re-engineered the annotation process to create a dataset for holistic audio understanding. |
|
We devide the dataset into six categories: |
|
|
|
| Category | Example Caption | |
|
|----------|-----------------| |
|
| Pure Speech | "A female voice narrates historical competition with synthetic modulation" | |
|
| Pure Sound | "Outdoor scene with wind, birds, duck quacking and background noise" | |
|
| Pure Music | "Crowd cheering with electronic synthesizer-driven soundscape" | |
|
| Mixed Music | "The audio features a crowd cheering and clapping alongside electronic music with a synthesizer-driven, dark, and energetic soundscape." | |
|
| Mixed Speech | "A Russian voice demonstrates a synthesizerβs capabilities over an experimental electronic backdrop, explaining its sound design and value in a gritty, vocal-fry tone." | |
|
| Mixed Sound | "A man speaks in English about entering a city and village, accompanied by the sounds of a running vehicle." | |
|
|
|
The figure below illustrates our data curation pipeline for ACAVCaps: |
|
|
|
<img src="fig/acavcaps-1.png" width="800"> |
|
|
|
Each caption is generated through a three-step process: |
|
|
|
1. **Multi-expert analysis** (speech, vocal, music, acoustics) |
|
2. **LLM reasoning** synthesizing metadata with [DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1) |
|
3. **Filtering** for audio-text consistency with [Dasheng-GLAP](https://github.com/xiaomi-research/dasheng-glap) |
|
|
|
We will **release the ACAVCaps dataset** after the ICASSP 2026 review process. |
|
|
|
## Usage |
|
|
|
### Load Model |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoProcessor, AutoTokenizer |
|
|
|
model_id = "mispeech/midashenglm-7b" |
|
|
|
model = AutoModelForCausalLM.from_pretrained(model_id, trust_remote_code=True) |
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) |
|
``` |
|
|
|
### Construct Prompt |
|
|
|
```python |
|
user_prompt = "Caption the audio." # You may try any other prompt |
|
|
|
messages = [ |
|
{ |
|
"role": "system", |
|
"content": [ |
|
{"type": "text", "text": "You are a helpful language and speech assistant."} |
|
], |
|
}, |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": user_prompt}, |
|
{ |
|
"type": "audio", |
|
"path": "/path/to/example.wav", |
|
# or "url": "https://example.com/example.wav" |
|
# or "audio": np.random.randn(16000) |
|
}, |
|
], |
|
}, |
|
] |
|
``` |
|
|
|
### Generate Output |
|
|
|
```python |
|
import torch |
|
|
|
with torch.no_grad(): |
|
model_inputs = processor.apply_chat_template( |
|
messages, |
|
tokenize=True, |
|
add_generation_prompt=True, |
|
add_special_tokens=True, |
|
return_dict=True, |
|
) |
|
generation = model.generate(**model_inputs) |
|
output = tokenizer.batch_decode(generation, skip_special_tokens=True) # ["An engine is idling."] |
|
``` |
|
|
|
## Results |
|
|
|
MiDashengLM delivers solid performance across diverse audio understanding tasks. |
|
|
|
### Audio Captioning Results |
|
|
|
| Domain | Dataset | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
|
|:--------:|:--------------:|:--------------:|:----------------:|:-------------------:| |
|
| Music | MusicCaps | **59.71** | 43.71 | 35.43 | |
|
| Music | Songdescriber | **45.39** | 45.31 | 44.63 | |
|
| Sound | AudioCaps | **62.18** | 60.79 | 49.00 | |
|
| Sound | ClothoV2 | **49.20** | 47.55 | 48.01 | |
|
| Sound | AutoACD | **66.52** | 55.93 | 44.76 | |
|
|
|
*Metrics: FENSE (higher is better).* |
|
|
|
### Audio and Paralinguistic Classification |
|
|
|
| Dataset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
|
|:----------------:|:------:|:--------------:|:----------------:|:------------------:| |
|
| VoxCeleb1 | ACCβ | **92.36** | 59.71 | 82.72 | |
|
| VoxLingua107 | ACCβ | **93.41** | 51.03 | 73.65 | |
|
| VoxCeleb-Gender | ACCβ | 96.12 | **99.82** | 99.69 | |
|
| VGGSound | ACCβ | **52.11** | 0.97 | 2.20 | |
|
| Cochlscene | ACCβ | **74.06** | 23.88 | 18.34 | |
|
| NSynth | ACCβ | **80.52** | 60.45 | 38.09 | |
|
| FMA | ACCβ | 63.73 | **66.77** | 27.91 | |
|
| FSDKaggle2018 | ACCβ | **75.25** | 31.38 | 24.75 | |
|
| AudioSet | mAPβ | **8.86** | 6.48 | 3.47 | |
|
| FSD50K | mAPβ | **37.58** | 23.87 | 27.23 | |
|
|
|
### ASR Performance |
|
|
|
| Dataset | Language | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
|
|:------------------:|:-----------:|:--------------:|:------------:|:-------------------:| |
|
| LibriSpeech test-clean | English | 3.7 | 1.7 | **1.3** | |
|
| LibriSpeech test-other | English | 6.2 | 3.4 | **2.4** | |
|
| People's Speech | English | 27.8 | 28.6 | **22.3** | |
|
| AISHELL2 Mic | Chinese | 3.2 | **2.5** | 2.7 | |
|
| AISHELL2 iOS | Chinese | 2.9 | **2.6** | **2.6** | |
|
| AISHELL2 Android | Chinese | 3.1 | 2.7 | **2.6** | |
|
| GigaSpeech2 | Indonesian | **20.8** | 21.2 | >100 | |
|
| GigaSpeech2 | Thai | **36.9** | 53.8 | >100 | |
|
| GigaSpeech2 | Viet | **18.1** | 18.6 | >100 | |
|
|
|
*Metrics: WER/CER (lower is better).* |
|
|
|
### Question Answering Results |
|
|
|
| Dataset | Subset | Metric | MiDashengLM | Qwen2.5-Omni-7B | Kimi-Audio-Instruct | |
|
|:------------:|:-------:|:------:|:--------------:|:----------------:|:-------------------:| |
|
| MuChoMusic | | ACCβ | **71.35** | 64.79 | 67.40 | |
|
| MMAU | Sound | ACCβ | 68.47 | 67.87 | **74.17** | |
|
| MMAU | Music | ACCβ | 66.77 | **69.16** | 61.08 | |
|
| MMAU | Speech | ACCβ | **63.66** | 59.76 | 57.66 | |
|
| MMAU | Average | ACCβ | **66.30** | 65.60 | 64.30 | |
|
| MusicQA | | FENSEβ | **62.35** | 60.60 | 40.00 | |
|
| AudioCaps-QA | | FENSEβ | **54.31** | 53.28 | 47.34 | |
|
|
|
*Metrics: Higher is better.* |
|
|
|
### Reproduction Instructions |
|
|
|
To reproduce our results, we provide: |
|
|
|
- Prompts ([prompt.csv](evaluate/prompt.csv)) |
|
- Evaluation scripts |
|
- Example JSONL files |
|
|
|
#### 1. Install Dependencies for Evaluation (No need this for inference) |
|
|
|
```bash |
|
pip install -r requirements.txt |
|
``` |
|
|
|
#### 2. Generate Model Outputs |
|
|
|
Generate responses using the model's official framework with prompts from [prompt.csv](evaluate/prompt.csv). |
|
|
|
#### 3. Convert Outputs to JSONL Format |
|
|
|
Format model outputs using the [example JSONL](evaluate/jsonl) files: |
|
|
|
| Task | Example File | |
|
|------|--------------| |
|
| Automatic Speech Recognition | [MiDashengLM_LibriSpeech_test-clean.jsonl](evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl) | |
|
| Single-target Audio Tagging | [MiDashengLM_NSynth.jsonl](evaluate/jsonl/MiDashengLM_NSynth.jsonl) | |
|
| Gender Recognition | [MiDashengLM_VoxCeleb-Gender.jsonl](evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl) | |
|
| Multi-target Audio Tagging | [MiDashengLM_FSD50K.jsonl](evaluate/jsonl/MiDashengLM_FSD50K.jsonl) | |
|
| Audio Captioning | [MiDashengLM_AutoACD.jsonl](evaluate/jsonl/MiDashengLM_AutoACD.jsonl) | |
|
| Open Audio Question Answering | [MiDashengLM_MusicQA.jsonl](evaluate/jsonl/MiDashengLM_MusicQA.jsonl) | |
|
| Audio QA with Options | [MiDashengLM_MuChoMusic.jsonl](evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl) | |
|
|
|
#### 4. Evaluate Results |
|
|
|
Execute the corresponding evaluation scripts: |
|
|
|
```bash |
|
# Automatic Speech Recognition (WER) |
|
# Uses: lang, text, model_output |
|
python evaluate/wer/compute_wer.py -i evaluate/jsonl/MiDashengLM_LibriSpeech_test-clean.jsonl |
|
|
|
# Single-target Audio Tagging (ACC) |
|
# Uses: label, model_output |
|
python evaluate/compute_at_acc.py -i evaluate/jsonl/MiDashengLM_NSynth.jsonl |
|
|
|
# Gender Recognition (ACC) |
|
# Uses: label, model_output |
|
python evaluate/compute_gender_acc.py -i evaluate/jsonl/MiDashengLM_VoxCeleb-Gender.jsonl |
|
|
|
# Multi-target Audio Tagging (mAP) |
|
# Uses: dataset_name, label, model_output, model_name |
|
python evaluate/compute_map.py -i evaluate/jsonl/MiDashengLM_FSD50K.jsonl |
|
|
|
# Audio Captioning (FENSE) |
|
# Uses: audio, text, model_output |
|
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_AutoACD.jsonl |
|
|
|
# Open Audio QA (FENSE) |
|
# Uses: audio, answer, model_output |
|
python evaluate/compute_fense.py -i evaluate/jsonl/MiDashengLM_MusicQA.jsonl |
|
|
|
# Audio QA with Options (ACC) |
|
# Uses: answer, model_output |
|
python evaluate/compute_qa_acc.py -i evaluate/jsonl/MiDashengLM_MuChoMusic.jsonl |
|
``` |
|
|
|
#### 5. Evaluate on MECAT and MMAU benchmarks |
|
|
|
Please refer to the official repositories for evaluation on the [MECAT](https://github.com/xiaomi-research/mecat) |
|
and [MMAU](https://github.com/Sakshi113/mmau) benchmarks. |
|
|
|
## Efficiency |
|
|
|
MiDashengLM demonstrates superior inference efficiency compared to Qwen2.5-Omni-7B, |
|
achieving 3.2Γ speedup at comparable batch sizes and an overall potential speedup of 20.2Γ with larger batches. |
|
|
|
<img src="fig/batchsize_1_comparison_7b-1.png" width="800"> |
|
|
|
| Batch Size | MiDashengLM (samples/s) | Qwen2.5-Omni-7B (samples/s) | Speedup | |
|
|:----------:|:-----------------------:|:----------------------------:|:-------:| |
|
| 1 | 0.45 | 0.36 | 1.25x | |
|
| 4 | 1.40 | 0.91 | 1.53x | |
|
| 8 | 2.72 | 1.15 | 2.36x | |
|
| 16 | 5.18 | OOM | - | |
|
| 32 | 9.78 | OOM | - | |
|
| 64 | 17.07 | OOM | - | |
|
| 128 | 22.73 | OOM | - | |
|
| 200 | 25.15 | OOM | - | |
|
|
|
*Tested on 80GB GPU with 30s audio, 100-token output.* |
|
|
|
## Training Data |
|
|
|
MiDashengLM is trained exclusively on publicly available datasets across five categories: Speech, Sound and General Audio, Speech and Paralinguistic, Music, and Question Answering. All datasets are listed below with their respective tasks, lengths, and supervised fine-tuning (SFT) usage. |
|
|
|
<img src="fig/pretraining_sampling_rates-1.png" width="1200"> |
|
|
|
### Speech Training Data |
|
|
|
This table lists speech-related datasets used for tasks like Automatic Speech Recognition (ASR), keyword spotting (KWS), and speech-to-text translation (S2TT). |
|
The column βSFT?β indicates whether the dataset is used for supervised fine-tuning. |
|
|
|
| Data | Task | Length(h) | SFT? | |
|
|:----------------------:|:---------:|:---------:|:----:| |
|
| LibriSpeech | ASR | 960 | β | |
|
| LibriHeavy | ASR | 50,000 | X | |
|
| GigaSpeech | ASR | 10,000 | β | |
|
| GigaSpeech2 | ASR | 30,000 | β | |
|
| WeNetSpeech | ASR | 10,000 | β | |
|
| Yodas | ASR | 320,000 | X | |
|
| CommonVoice-17.0 | ASR | 5,000 | β | |
|
| AISHELL-1 | ASR | 100 | β | |
|
| AISHELL-2 | ASR | 1,000 | β | |
|
| AISHELL-3 | ASR | 70 | β | |
|
| LJSpeech-1.1 | ASR | 37 | X | |
|
| LibriTTS | ASR | 585 | X | |
|
| MultiLingualSpokenWords| KWS | 5,000 | X | |
|
| Emilia | ASR | 101,000 | β | |
|
| CovoST-v2 | S2TT | 2,880 | β | |
|
| Fleurs | S2TT | 1,224 | X | |
|
| MSR-86K | ASR, LangID| 86,000 | β | |
|
| ACAV100M-Speech | ASR | 55,754 | X | |
|
| Must-C | ASR,S2TT | 1,000 | β | |
|
| MLS | ASR | 50,000 | X | |
|
| SpgiSpeech | ASR | 5,000 | X | |
|
| PeoplesSpeech | ASR | 30,000 | X | |
|
| KeSpeech | ASR | 1,400 | β | |
|
| LAION-300M | Caption | 230,000 | X | |
|
| **Total** | | **997,010**| **258.410** | |
|
|
|
### Sound and General Audio Datasets |
|
|
|
| Dataset | Task | Length(h) | SFT? | |
|
|:--------------:|:------------------------:|:---------:|:----:| |
|
| FSD50k | Sound Event | 77 | β | |
|
| AudioSet | Sound Event | 5,200 | | |
|
| AudioSet-strong| Sound Event | 220 | X | |
|
| VGGSound | Sound Event | 540 | β | |
|
| FSDKaggle2018 | Sound Event | 20 | β | |
|
| FSDKaggle2019 | Sound Event | 100 | | |
|
| ARCA23k | Sound Event | 120 | X | |
|
| AutoACD | Audio(Sound) Caption | 5,200 | β | |
|
| AudioSetCaps | Audio(Sound) Caption | 6,000 | β | |
|
| SoundVECaps | Audio(Sound) Caption | 5,000 | β | |
|
| WavCaps | Audio(Sound) Caption | 7,567 | β | |
|
| Audiocaps | Audio(Sound) Caption | 100 | β | |
|
| Clothov2 | Audio(Sound) Caption | 17 | β | |
|
| TACOS | Audio(Sound) Caption | 98 | β | |
|
| CochlScene | SoundScape | 500 | β | |
|
| BirdSet | SoundScape | 7,000 | X | |
|
| ACAVCaps | General Caption | 38,662 | β | |
|
| **Total** | | **76.421**| **69.081** | |
|
|
|
### Speech and Paralinguistic Datasets |
|
|
|
| Dataset | Task | Length(hours) | SFT? | |
|
|:------------------:|:-----------------------------:|:-------------:|:----:| |
|
| IEMOCAP | Emotion | 8 | β | |
|
| Meld | Emotion | 12 | β | |
|
| SUBESCO | Emotion | 9 | X | |
|
| RAVDESS-Speech | Emotion | 2 | X | |
|
| RAVDESS-Song | Emotion | 1 | X | |
|
| CREMA-D | Emotion | 4 | X | |
|
| ESD | Emotion | 29 | X | |
|
| VocalSound | Vocal sound classification | 20 | β | |
|
| NonSpeech7k | Vocal sound classification | 3 | β | |
|
| VoxLingua107 | Language identification | 7,200 | β | |
|
| CommonLanguage | Language identification | 45 | β | |
|
| YLACombe | Language identification | 5 | X | |
|
| VoxCeleb1 | Speaker verification | 76 | β | |
|
| CNCeleb | Speaker verification & age | 2,100 | β | |
|
| VoxCeleb2 | Speaker verification | 1,000 | β | |
|
| VoxBlink1 | Speaker verification | 1,300 | | |
|
| VoxBlink2 | Speaker verification | 2,600 | β | |
|
| VoxTube | Language identification | 5,200 | β | |
|
| LibriCount | Speaker counting | 8 | β | |
|
| FluentSpeechCommands | Intent classification & gender | 17 | X | |
|
| SpeechOcean762 | Speaker age | 5 | X | |
|
| ASVSpoof5 | Spoof detection | 603 | X | |
|
| **Total** | | **20,247** | **19,572** | |
|
|
|
### Music-Related Datasets |
|
|
|
Covers music captioning, genre recognition, instrument classification, and singing style identification. |
|
|
|
| Dataset | Task | Length(h) | SFT? | |
|
|:---------------:|:---------------------------------:|:---------:|:----:| |
|
| MusicCaps | Music Caption | 15 | β | |
|
| Songdescriber | Music Caption | 23 | β | |
|
| LPMusicCaps-MTT | Music Caption | 18 | β | |
|
| LPMusicCaps-MSD | Music Caption | 1,000 | β | |
|
| VocalSet | Singing style identification | 10 | X | |
|
| FreeMusicArchive| Genre recognition | 610 | β | |
|
| MTG-Jamendo | Instrument classification Genre recognition | 3,768 | β | |
|
| NSynth | Instrument classification | 360 | β | |
|
| GoodSounds | Instrument classification | 28 | β | |
|
| chMusic | Instrument classification | 1 | β | |
|
| CTIS | Instrument classification | 1 | β | |
|
| **Total** | | **5,824** | **5,814** | |
|
|
|
### Question Answering Datasets |
|
|
|
Used for training on audio-visual QA, environment QA, and music QA tasks. Most support SFT. |
|
|
|
| Dataset | Task | # QA | SFT? | |
|
|:---------:|:---------------:|:--------:|:----:| |
|
| AVQA | Environment QA | 36,114 | β | |
|
| ClothoAQA | Environment QA | 6,175 | β | |
|
| TACOS+ | Environment QA | 40,019 | β | |
|
| MusicQA | Music QA | 112,878 | β | |
|
| SIFT-50M | Speech QA | 21,430,000 | β | |
|
| ACAV-QA | General QA | 24,371 | β | |
|
|
|
## Citation |
|
|
|
MiDashengLM is under the Apache License 2.0, and we encourage its use in **both research and business applications**. |
|
|
|
If you find MiDashengLM useful in your research, please consider citing our work: |
|
|
|
```bibtex |
|
@techreport{midashenglm7b, |
|
title = {MiDashengLM: Efficient Audio Understanding with General Audio Captions}, |
|
author = {{Horizon Team, MiLM Plus}}, |
|
institution= {Xiaomi Inc.}, |
|
year = {2025}, |
|
note = {Contributors: Heinrich Dinkel et al. (listed alphabetically in Appendix B)}, |
|
url = {https://arxiv.org/abs/2508.03983}, |
|
eprint = {2508.03983}, |
|
} |
|
``` |
|
|