Model Description
Hey there! This is voice-embedder-base
, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the openai/whisper-base
encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too.
- Developed by: John Backsund
- Model Type: Speaker Embedding
- Base Model:
openai/whisper-base
encoder - Embedding Size: 256
- Training Data: Common Voice 17 (en) Train split derived dataset
- License: MIT
Intended Use
This model is great for:
- Speaker Clustering: Grouping audio samples by speaker (e.g., for diarization).
- Speaker Verification: Checking if two audio clips are from the same speaker.
- Voice Retrieval: Finding similar voices in a dataset.
It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews).
How to Use
To use this model, you’ll need the voice-finder
library, which includes the custom VoiceEmbedder
and VoiceEmbedderFeatureExtractor
classes. Install it from GitHub, then load the model and processor with transformers
.
pip install git+https://github.com/johBac97/voice-finder.git
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base")
model = AutoModel.from_pretrained("johbac/voice-embedder-base")
# Example: Process audio and get embeddings
audio = [...] # List of audio arrays (16kHz)
features = processor(audio, sampling_rate=16000, return_tensors="pt")
embeddings = model(**features) # Shape: [batch_size, 256]
Training Details
- Architecture: Whisper encoder + MLP projector (512 → 256 dims).
- Loss: Combined hard-mining triplet loss (supervised) + NT-Xent loss
- Augmentations: Gaussian noise (for NT-Xent).
- Validation Datasets:
- Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev).
- VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English).
- Preprocessing: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives.
See the full report for details.
Performance
Here’s how the model performs on the dev sets:
Dataset | Top-1 Accuracy | Top-5 Accuracy | Equal Error Rate | Avg Same L2 Dist | Avg Diff L2 Dist |
---|---|---|---|---|---|
Common Voice 17 (en) | 94.13% | 98.17% | 1.05% | 0.5456 | 1.3617 |
VoxCeleb2 (en) | 14.21% | 22.87% | 18.20% | 0.8152 | 1.1514 |
- Strengths: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER.
- Weaknesses: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (
23%) is still way better than random (0.1%).
Limitations
- Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy.
- Trained only on English speech, so performance on other languages is untested.
- Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise.
Future Improvements
- Add augmentations for real-world noise (e.g., city sounds, background voices).
- Train on more diverse, noisy datasets to improve robustness.
Citation
Inspired by Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings.
- Downloads last month
- 0
Model tree for johbac/voice-embedder-base
Base model
openai/whisper-base