--- license: cc-by-nc-sa-4.0 pipeline_tag: feature-extraction tags: - automatic-speech-recognition - audio-classification - audio - speech - music library_name: transformers datasets: - openslr/librispeech_asr - facebook/multilingual_librispeech - mozilla-foundation/common_voice_17_0 - speechcolab/gigaspeech - facebook/voxpopuli - agkphysics/AudioSet language: - en --- # USAD: Universal Speech and Audio Representation via Distillation **Universal Speech and Audio Distillation (USAD)** is a unified **speech**, **sound**, and **music** encoder distilled from domain-specific teachers. Trained on 126k hours of mixed data, USAD delivers competitive performance across diverse benchmarks (SUPERB, HEAR, and AudioSet) with a single model. [👀 **Read Full Paper**](https://arxiv.org/abs/2506.18843) --- ## 🗂️ Models USAD models are all transformer encoders operating at **50Hz frame rate**. The teacher models are **WavLM Base+** and **ATST Frame**. | Model | Parameters | Dim | Layer | Checkpoint | | ---------- | ---------- | ---- | ----- | ------------------------------------------------- | | USAD Small | 24M | 384 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Small) | | USAD Base | 94M | 768 | 12 | [link](https://huggingface.co/MIT-SLS/USAD-Base) | | USAD Large | 330M | 1024 | 24 | [link](https://huggingface.co/MIT-SLS/USAD-Large) | --- ## 🚀 How To Use **Installation** ``` pip install -U transformers ``` **Load Model and Extract Features** ```python import torch from transformers import AutoModel # Load pre-trained model model = AutoModel.from_pretrained("MIT-SLS/USAD-Base", trust_remote_code=True).cuda().eval() # Load audio and resample to 16kHz wav = model.load_audio("path/to/audio").unsqueeze(0) # (batch_size, wav_len) # wav is a float tensor on the same device as the model # You can also load waveforms directly with torchaudio.load # Extract features with torch.no_grad(): results = model(wav) # result["x"]: model final output (batch_size, seq_len) # result["mel"]: mel fbank (batch_size, seq_len * 2, mel_dim) # result["hidden_states"]: list of (batch_size, seq_len, encoder_dim) # result["ffn"]: list of (batch_size, seq_len, encoder_dim) ``` See [usad_model.py](https://huggingface.co/MIT-SLS/USAD-Base/blob/main/usad_model.py) for more details about the model. --- ## 📖 Citation ```bibtex @article{chang2025usad, title={{USAD}: Universal Speech and Audio Representation via Distillation}, author={Chang, Heng-Jui and Bhati, Saurabhchand and Glass, James and Liu, Alexander H.}, journal={arXiv preprint arXiv:2506.18843}, year={2025} } ``` --- ## 🙏 Acknowledgement Our implementation is based on the awesome [facebookresearch/fairseq](https://github.com/facebookresearch/fairseq), [cwx-worst-one/EAT](https://github.com/cwx-worst-one/EAT), and [sooftware/conformer](https://github.com/sooftware/conformer) repositories.