DASS: Distilled Audio State-space Models
DASS: Distilled Audio State-space Models is an audio classification model finetuned on AudioSet-2M. DASS is the first state-space model that outperforms transformer-based audio classifiers such as AST (Audio Spectrogram Transformer), HTS-AT, and Audio-MAE. DASS achieves state-of-the-art performance on the audio-classification task on Audioset while significantly reducing the model size. For example, compared to AST which contains approximately 87M parameters, DASS-small contains one-third, 30M, parameters and outperforms the AST model (AudioSet-2M map: 45.9 vs DASS small mAP: 47.2). It is available in two variants: DASS small (30M) mAP: 47.2 and DASS medium (49M) mAP: 47.6.
It is also significantly more duration robust (training on shorter audio and testing on long audio without fine-tuning on longer audio) than the AST model. For example, for both AST and DASS models training on 10-second long audios, the performance of AST models drops to less than 5 mAP when the input is 50 seconds, which is < 12% of the performance for 10-second input, while DASS’s performance is 45.5 mAP (96%) in the same setting. On a single A6000 GPU, DASS can take up to 2.5-hours of audio input and still maintain 62% of its performance compared to a 10-second input.
It is introduced in the paper DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners and first released in this repository.
Model Details
DASS model in based on the VMamba: Visual State Space Model applied to audio. It is trained with binary cross entropy loss w.r.t. ground truth labels and kl-divergence loss w.r.t teacher AST model.
How to Get Started with the Model
Use the code below to get started with the model.
import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor
config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')
with torch.no_grad():
logits = torch.sigmoid(audio_model(**inputs).logits)
predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']
Results
Below are the results for DASS models finetuned and evaluated on AudioSet-2M.
Params | Pretrain | mAP | |
---|---|---|---|
Transformer based models | |||
AST | 87M | IN SL | 45.9 |
HTS-AT | 31M | IN SL | 47.1 |
PaSST | IN SL | 47.1 | |
Audio-MAE | 86M | SSL | 47.3 |
Concurrent SSM models | |||
AuM | 26M | IN SL | 39.7 |
Audio Mamba | 40M | IN SL | 44.0 |
DASS-Small | 30M | IN SL | 47.2 |
DASS-Medium | 49M | IN SL | 47.6 |
Citation
@article{bhati2024dass,
title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
journal={arXiv preprint arXiv:2407.04082},
year={2024}
}
Acknowledgements
This project is based on AST(paper, code), VMamba(paper, code) thanks for their excellant works. Please make sure to check them out.
- Downloads last month
- 39