johbac/voice-embedder-base

Model Description

Hey there! This is voice-embedder-base, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the openai/whisper-base encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too.

Developed by: John Backsund
Model Type: Speaker Embedding
Base Model: openai/whisper-base encoder
Embedding Size: 256
Training Data: Common Voice 17 (en) Train split derived dataset
License: MIT

Intended Use

This model is great for:

Speaker Clustering: Grouping audio samples by speaker (e.g., for diarization).
Speaker Verification: Checking if two audio clips are from the same speaker.
Voice Retrieval: Finding similar voices in a dataset.

It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews).

How to Use

To use this model, you’ll need the voice-finder library, which includes the custom VoiceEmbedder and VoiceEmbedderFeatureExtractor classes. Install it from GitHub, then load the model and processor with transformers.

pip install git+https://github.com/johBac97/voice-finder.git

from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base")
model = AutoModel.from_pretrained("johbac/voice-embedder-base")

# Example: Process audio and get embeddings
audio = [...] # List of audio arrays (16kHz)
features = processor(audio, sampling_rate=16000, return_tensors="pt")
embeddings = model(**features) # Shape: [batch_size, 256]

Training Details

Architecture: Whisper encoder + MLP projector (512 → 256 dims).
Loss: Combined hard-mining triplet loss (supervised) + NT-Xent loss
Augmentations: Gaussian noise (for NT-Xent).
Validation Datasets:
Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev).
VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English).
Preprocessing: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives.

See the full report for details.

Performance

Here’s how the model performs on the dev sets:

Dataset	Top-1 Accuracy	Top-5 Accuracy	Equal Error Rate	Avg Same L2 Dist	Avg Diff L2 Dist
Common Voice 17 (en)	94.13%	98.17%	1.05%	0.5456	1.3617
VoxCeleb2 (en)	14.21%	22.87%	18.20%	0.8152	1.1514

Strengths: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER.
Weaknesses: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (~~23%) is still way better than random (~~0.1%).

Limitations

Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy.
Trained only on English speech, so performance on other languages is untested.
Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise.

Future Improvements

Add augmentations for real-world noise (e.g., city sounds, background voices).
Train on more diverse, noisy datasets to improve robustness.

Citation

Inspired by Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings.

johbac
/

voice-embedder-base