Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands
Model Details
- Model name:
ast-mlcommons-speech-commands
- Architecture: Audio Spectrogram Transformer (AST)
- Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
- Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with
_silence_
and_unknown_
categories sampled from Google Speech Commands v0.02 - License: bsd-3-clause
Model Inputs and Outputs
- Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
- Output: Softmax over 80 classes (indices 0โ79). Classes mapping:
{ "0": "_silence_", "1": "_unknown_", "2": "air", // ... 3โ9 omitted for brevity ... "9": "cake", "10": "car", // ... up to 79: "zoo" }
Training Data
Total samples: ~145,005 utterances
Sources:
- MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
- Google Speech Commands v0.02 for silence and unknown categories
Preprocessing:
- Resampling to 16 kHz
- Fixed-length one-second windows with zero-padding or cropping
Evaluation Results
Metric | Value |
---|---|
Loss | 0.0685 |
Precision | 0.9862 |
Recall | 0.9862 |
F1-score | 0.9861 |
Intended Uses and Limitations
Suitable for:
- Real-time keyword spotting on-device
- Low-latency voice command detection in noisy environments
Limitations:
- May misclassify under unseen noise conditions or heavy accents
_unknown_
class may not cover all out-of-vocabulary words; false positives possible- Performance may degrade on dialects or languages underrepresented in training
Citation
@inproceedings{gong2021ast,
title={AST: Audio Spectrogram Transformer},
author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
booktitle={ICASSP},
year={2022}
}
- Downloads last month
- 21
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for mahmoudmamdouh13/ast-mlcommons-speech-commands
Base model
MIT/ast-finetuned-speech-commands-v2Evaluation results
- Precision on audiofoldervalidation set self-reported0.986
- Recall on audiofoldervalidation set self-reported0.986
- F1 on audiofoldervalidation set self-reported0.986