Audio Spectrogram Transformer (AST) Fine-Tuned on MLCommons Multilingual Spoken Words + Google Speech Commands

Model Details

  • Model name: ast-mlcommons-speech-commands
  • Architecture: Audio Spectrogram Transformer (AST)
  • Base pre-trained checkpoint: MIT AST fine-tuned on Google Speech Commands v0.02
  • Fine-tuning dataset: Custom dataset drawn from MLCommons Multilingual Spoken Words corpus, augmented with _silence_ and _unknown_ categories sampled from Google Speech Commands v0.02
  • License: bsd-3-clause

Model Inputs and Outputs

  • Input: 16 kHz mono audio, 1-second clips (or padded/truncated to 1 sec), converted to log-mel spectrograms with 128 mel bins and 10 ms hop length
  • Output: Softmax over 80 classes (indices 0โ€“79). Classes mapping:
    {
      "0": "_silence_",
      "1": "_unknown_",
      "2": "air",
      // ... 3โ€“9 omitted for brevity ...
      "9": "cake",
      "10": "car",
      // ... up to 79: "zoo"
    }
    

Training Data

  • Total samples: ~145,005 utterances

  • Sources:

    • MLCommons Multilingual Spoken Words corpus (covering 40+ languages)
    • Google Speech Commands v0.02 for silence and unknown categories
  • Preprocessing:

    • Resampling to 16 kHz
    • Fixed-length one-second windows with zero-padding or cropping

Evaluation Results

Metric Value
Loss 0.0685
Precision 0.9862
Recall 0.9862
F1-score 0.9861

Intended Uses and Limitations

  • Suitable for:

    • Real-time keyword spotting on-device
    • Low-latency voice command detection in noisy environments
  • Limitations:

    • May misclassify under unseen noise conditions or heavy accents
    • _unknown_ class may not cover all out-of-vocabulary words; false positives possible
    • Performance may degrade on dialects or languages underrepresented in training

Citation

@inproceedings{gong2021ast,
  title={AST: Audio Spectrogram Transformer},
  author={Gong, Yufei and Tian, Wei and Shen, Ding and Ermon, Stefano and Liu, Fei and Lazebnik, Svetlana},
  booktitle={ICASSP},
  year={2022}
}
Downloads last month
21
Safetensors
Model size
85.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for mahmoudmamdouh13/ast-mlcommons-speech-commands

Finetuned
(29)
this model

Evaluation results