Perch Bird Vocalizations

Perch is a bioacoustics model trained to classify nearly 15,000 species and generate audio embeddings that are useful for a variety of downstream applications (such as individual identification or estimating coral reef health). It has been used to detect critically endangered birds and power audio search engines.

The current model (Perch 2.0) is an update to our original Perch model with improved embedding and prediction quality, as well as support for many new (non-avian) taxa. The model was trained on a combination of publicly available audio from Xeno-Canto, iNaturalist, Animal Sound Archive, and FSD50k: If you like this model, consider recording some interesting audio and contributing it to a public source!

Perch makes predictions for most bird species as well as a variety of frogs, crickets, grasshoppers and mammals. But note that the output logits for species are uncalibrated and possibly unreliable for rare species, and we recommend that you use your own data to tune detection thresholds.

The embeddings were trained with the goal of being linearly separable. For most cases training a simple linear classifier on top of the model’s outputs should work well. For most bioacoustics applications we recommend using an agile modelling (human-annotator-in-the-loop) workflow.

Model Quality

The Perch 2.0 model was evaluated on a variety of tasks and domains: species classification in avian soundscapes, call type and dialect recognition, individual identification of dogs and bats, event detection in coral reefs, etc. It achieves state-of-the-art scores on bioacoustics benchmarks such as BirdSet and BEANS. See our paper for more details.

Model Description

Perch 2.0’s embedding model is based on an EfficientNet-B3 architecture with approximately 12 million parameters. The species classification head adds an additional 91 million parameters (due to the large number of classes).

The model outputs 1536-dimensional embeddings. It is also possible to retrieve the embeddings before spatial pooling. These have dimensions (5, 3, 1536).

Note: This version of the model requires TensorFlow 2.20.rc0 and a GPU.
A CPU variant will be added soon.

Perch 2.0’s embedding model is based on an EfficientNet-B3 architecture with approximately 12 million parameters.
The species classification head adds an additional 91 million parameters due to the large number of classes.

Input

The model consumes 5-second segments of audio sampled at 32 kHz.
For audio with other sample rates, you can:
- Resample the audio.
- Apply pitch shifting (works well for bats in some cases).
- Feed the audio in its native sample rate as an array of 160,000 values.

Outputs

The model produces the following outputs:

Spectrogram computed from the input audio.
Embedding: A 1536-dimensional vector.
Spatial Embedding: Un-pooled embeddings with shape (5, 3, 1536).
Logit Predictions for ~15,000 classes (of which ~10,000 are birds).
- The predicted classes are detailed in assets/labels.csv following the iNaturalist taxonomy.
- An additional set of conversions to eBird six-letter codes is provided in assets/perch_v2_ebird_classes.csv.

Example Use

!pip install git+https://github.com/google-research/perch-hoplite.git
!pip install tensorflow[and-cuda]~=2.20.0rc0

from perch_hoplite.zoo import model_configs

# Input: 5 seconds of silence as mono 32 kHz waveform samples.
waveform = np.zeros(5 * 32000, dtype=np.float32)

# Automatically downloads the model from Kaggle.
model = model_configs.load_model_by_name('perch_v2')

outputs = model.embed(waveform)
# do something with outputs.embeddings and outputs.logits['label']