DASS: Distilled Audio State-space Models

DASS: Distilled Audio State-space Models is an audio classification model finetuned on AudioSet-2M. DASS is the first state-space model that outperforms transformer-based audio classifiers such as AST (Audio Spectrogram Transformer), HTS-AT, and Audio-MAE. DASS achieves state-of-the-art performance on the audio-classification task on Audioset while significantly reducing the model size. For example, compared to AST which contains approximately 87M parameters, DASS-small contains one-third, 30M, parameters and outperforms the AST model (AudioSet-2M map: 45.9 vs DASS small mAP: 47.2). It is available in two variants: DASS small (30M) mAP: 47.2 and DASS medium (49M) mAP: 47.6.

It is also significantly more duration robust (training on shorter audio and testing on long audio without fine-tuning on longer audio) than the AST model. For example, for both AST and DASS models training on 10-second long audios, the performance of AST models drops to less than 5 mAP when the input is 50 seconds, which is < 12% of the performance for 10-second input, while DASS’s performance is 45.5 mAP (96%) in the same setting. On a single A6000 GPU, DASS can take up to 2.5-hours of audio input and still maintain 62% of its performance compared to a 10-second input.

It is introduced in the paper DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners and first released in this repository.

Model Details

DASS model in based on the VMamba: Visual State Space Model applied to audio. It is trained with binary cross entropy loss w.r.t. ground truth labels and kl-divergence loss w.r.t teacher AST model.

How to Get Started with the Model

Use the code below to get started with the model.


import torch
import librosa
from transformers import AutoConfig, AutoModelForAudioClassification, AutoFeatureExtractor

config = AutoConfig.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
audio_model = AutoModelForAudioClassification.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained('saurabhati/DASS_small_AudioSet_47.2',trust_remote_code=True)

waveform, sr = librosa.load("audio/eval/_/_/--4gqARaEJE_0.000.flac", sr=16000)
inputs = feature_extractor(waveform,sr, return_tensors='pt')

with torch.no_grad():
    logits = torch.sigmoid(audio_model(**inputs).logits)

predicted_class_ids = torch.where(logits[0] > 0.5)[0]
predicted_label = [audio_model.config.id2label[i.item()] for i in predicted_class_ids]
predicted_label
['Animal', 'Domestic animals, pets', 'Dog']

Results

Below are the results for DASS models finetuned and evaluated on AudioSet-2M.

Params Pretrain mAP
Transformer based models
AST 87M IN SL 45.9
HTS-AT 31M IN SL 47.1
PaSST IN SL 47.1
Audio-MAE 86M SSL 47.3
Concurrent SSM models
AuM 26M IN SL 39.7
Audio Mamba 40M IN SL 44.0
DASS-Small 30M IN SL 47.2
DASS-Medium 49M IN SL 47.6

Citation

@article{bhati2024dass,
  title={DASS: Distilled Audio State Space Models Are Stronger and More Duration-Scalable Learners},
  author={Bhati, Saurabhchand and Gong, Yuan and Karlinsky, Leonid and Kuehne, Hilde and Feris, Rogerio and Glass, James},
  journal={arXiv preprint arXiv:2407.04082},
  year={2024}
}

Acknowledgements

This project is based on AST(paper, code), VMamba(paper, code) thanks for their excellant works. Please make sure to check them out.

Downloads last month
39
Safetensors
Model size
29.9M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support