DASS: Distilled Audio State-space Models

DASS: Distilled Audio State-space Models is an audio classification model finetuned on AudioSet-2M. DASS is the first state-space model that outperforms transformer-based audio classifiers such as AST (Audio Spectrogram Transformer), HTS-AT, and Audio-MAE. DASS achieves state-of-the-art performance on the audio-classification task on Audioset while significantly reducing the model size. For example, compared to AST which contains approximately 87M parameters, DASS-small contains one-third, 30M, parameters and outperforms the AST model (AudioSet-2M map: 45.9 vs DASS small mAP: 47.2). It is available in two variants: DASS small (30M) mAP: 47.2 and DASS medium (49M) mAP: 47.6.

It is also significantly more duration robust (training on shorter audio and testing on long audio without fine-tuning on longer audio) than the AST model. For example, for both AST and DASS models training on 10-second long audios, the performance of AST models drops to less than 5 mAP when the input is 50 seconds, which is < 12% of the performance for 10-second input, while DASS’s performance is 45.5 mAP (96%) in the same setting. On a single A6000 GPU, DASS can take up to 2.5-hours of audio input and still maintain 62% of its performance compared to a 10-second input.

It is introduced in the paper DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners and first released in this repository.

Model Details

DASS model in based on the VMamba: Visual State Space Model applied to audio. It is trained with binary cross entropy loss w.r.t. ground truth labels and kl-divergence loss w.r.t teacher AST model.

How to Get Started with the Model

Use the code below to get started with the model.


import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor

config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)

waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')

with torch.no_grad():
    logits = torch.sigmoid(audio_model(**inputs).logits)

predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']

Results

Below are the results for DASS models finetuned and evaluated on AudioSet-2M.

	Params	Pretrain	mAP
Transformer based models
AST	87M	IN SL	45.9
HTS-AT	31M	IN SL	47.1
PaSST		IN SL	47.1
Audio-MAE	86M	SSL	47.3
Concurrent SSM models
AuM	26M	IN SL	39.7
Audio Mamba	40M	IN SL	44.0
DASS-Small	30M	IN SL	47.2
DASS-Medium	49M	IN SL	47.6

Citation

@article{bhati2024dass,
  title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
  author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
  journal={arXiv preprint arXiv:2407.04082},
  year={2024}
}

Acknowledgements

This project is based on AST(paper, code), VMamba(paper, code) thanks for their excellant works. Please make sure to check them out.