Disclaimer: There might be some error in the models, we have to check it.

Bird-MAE-Base: Can Masked Autoencoders Also Listen to Birds?

Abstract

Masked Autoencoders (MAEs) have shown competitive results in audio classification by learning rich semantic representations through an efficient self-supervised reconstruction task. However, general-purpose models fail to generalize well when applied directly to fine-grained audio domains. Specifically, bird-sound classification requires distinguishing subtle inter-species differences and managing high intra-species acoustic variability, thereby revealing the performance limitations of general-domain Audio-MAE models. This work demonstrates that bridging this domain gap requires more than domain-specific pretraining data; adapting the entire training pipeline is crucial. We systematically revisit and adapt the pretraining recipe, fine-tuning methods, and frozen feature utilization to bird sounds using BirdSet, a large-scale bioacoustic dataset comparable to AudioSet. Our resulting Bird-MAE achieves new state-of-the-art results in BirdSet's multi-label classification benchmark. Additionally, we introduce the parameter-efficient prototypical probing, enhancing the utility of frozen MAE representations and closely approaching fine-tuning performance in low-resource settings. Bird-MAE's prototypical probes outperform linear probing by up to 37%_\text{p} in MAP and narrow the gap to fine-tuning to approximately 3.3%_\text{p} on average across BirdSet downstream tasks. Bird-MAE also demonstrates robust few-shot capabilities with prototypical probing in our newly established few-shot benchmark on BirdSet, highlighting the potential of tailored self-supervised learning pipelines for fine-grained audio domains.

Evaluation Results

Table 1

Probing results on the multi-label classification benchmark BirdSet with full data (MAP%). Comparison of linear probing vs. prototypical probing using frozen encoder representations. Models follow the evaluation protocol of BirdSet. Best and results are highlighted.

Model Arch. Probing HSNval POW PER NES UHH NBP SSW SNE
BirdAVES HUBERT linear 14.91 12.60 5.41 6.36 11.76 33.68 4.55 7.86
BirdAVES HUBERT proto 32.52 19.98 5.14 11.87 15.41 39.85 7.71 9.59
SimCLR CvT-13 linear 17.29 17.89 6.66 10.64 7.43 26.35 6.99 8.92
SimCLR CvT-13 proto 18.00 17.02 3.37 7.91 7.08 26.60 5.36 8.83

Audio-MAE ViT-B/16 linear 8.77 10.36 3.72 4.48 10.78 24.70 2.50 5.60
Audio-MAE ViT-B/16 proto 19.42 19.58 9.34 15.53 16.84 35.32 8.81 12.34

Bird-MAE ViT-B/16 linear 13.06 14.28 5.63 8.16 14.75 34.57 5.59 8.16
Bird-MAE ViT-B/16 proto 43.84 37.67 20.72 28.11 26.46 62.68 22.69 22.16
Bird-MAE ViT-B/16 linear 12.44 16.20 6.63 8.31 15.41 41.91 5.75 7.94
Bird-MAE ViT-B/16 proto 49.97 51.73 31.38 37.80 29.97 69.50 37.74 29.96
Bird-MAE ViT-L/16 linear 13.25 14.82 7.29 7.93 12.99 38.71 5.60 7.84
Bird-MAE ViT-L/16 proto 47.52 49.65 30.43 35.85 28.91 69.13 35.83 28.31

For more details refer to the paper provided.

Example

This model can be easily loaded and used for inference with the transformers library.

Note that this is the base model and you need to finetune the classification head. We provide the option to use a Linear and Proto Probing head.

from transformers import AutoFeatureExtractor, AutoModel
import librosa

# Load the model and feature extractor
model = AudioModel.from_pretrained("DBD-research-group/Bird-MAE-Base",trust_remote_code=True)
feature_extractor = AutoFeatureExtractor.from_pretrained("DBD-research-group/Bird-MAE-Base", trust_remote_code=True)
model.eval()

# Load an example audio file
audio_path = librosa.ex('robin')

# The model is trained on audio sampled at 32,000 Hz
audio, sample_rate = librosa.load(audio_path, sr=32_000)

mel_spectrogram = feature_extractor(audio)

# embedding with shape corresponding to model size
embedding = model(mel_spectrogram) 

Citation

@misc{rauch2025audiomae,
      title={Can Masked Autoencoders Also Listen to Birds?}, 
      author={Lukas Rauch and René Heinrich and Ilyass Moummad and Alexis Joly and Bernhard Sick and Christoph Scholz},
      year={2025},
      eprint={2504.12880},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2504.12880}, 
}
Downloads last month
316
Safetensors
Model size
85.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train DBD-research-group/Bird-MAE-Base