Model Description

Hey there! This is voice-embedder-base, a model that generates speaker embeddings—compact vectors that capture unique vocal characteristics for tasks like speaker verification, clustering, or voice similarity retrieval. It’s built by fine-tuning the openai/whisper-base encoder with a contrastive learning approach, using a mix of triplet loss and NT-Xent loss to make embeddings robust and speaker-discriminative. Trained on English speech from Common Voice 17 and VoxCeleb2 datasets, it shines in clean studio settings but holds its own in noisier environments too.

  • Developed by: John Backsund
  • Model Type: Speaker Embedding
  • Base Model: openai/whisper-base encoder
  • Embedding Size: 256
  • Training Data: Common Voice 17 (en) Train split derived dataset
  • License: MIT

Intended Use

This model is great for:

  • Speaker Clustering: Grouping audio samples by speaker (e.g., for diarization).
  • Speaker Verification: Checking if two audio clips are from the same speaker.
  • Voice Retrieval: Finding similar voices in a dataset.

It’s best for clean audio (like studio recordings) but can handle some noise, though performance drops in very noisy settings (e.g., crowded interviews).

How to Use

To use this model, you’ll need the voice-finder library, which includes the custom VoiceEmbedder and VoiceEmbedderFeatureExtractor classes. Install it from GitHub, then load the model and processor with transformers.

pip install git+https://github.com/johBac97/voice-finder.git
from transformers import AutoProcessor, AutoModel
processor = AutoProcessor.from_pretrained("johbac/voice-embedder-base")
model = AutoModel.from_pretrained("johbac/voice-embedder-base")

# Example: Process audio and get embeddings
audio = [...] # List of audio arrays (16kHz)
features = processor(audio, sampling_rate=16000, return_tensors="pt")
embeddings = model(**features) # Shape: [batch_size, 256]

Training Details

  • Architecture: Whisper encoder + MLP projector (512 → 256 dims).
  • Loss: Combined hard-mining triplet loss (supervised) + NT-Xent loss
  • Augmentations: Gaussian noise (for NT-Xent).
  • Validation Datasets:
  • Common Voice 17 (en) Derived :1,257 speakers (train); 6,270 samples, 2,090 speakers (dev).
  • VoxCeleb2 (en) Derived: 12,756 noisy samples, 4,252 speakers (dev, filtered for English).
  • Preprocessing: Audio resampled to 16kHz, processed with Whisper Feature Extractor, stored in Zarr archives.

See the full report for details.

Performance

Here’s how the model performs on the dev sets:

Dataset Top-1 Accuracy Top-5 Accuracy Equal Error Rate Avg Same L2 Dist Avg Diff L2 Dist
Common Voice 17 (en) 94.13% 98.17% 1.05% 0.5456 1.3617
VoxCeleb2 (en) 14.21% 22.87% 18.20% 0.8152 1.1514
  • Strengths: Nails speaker identification in clean audio (Common Voice), with high accuracy and low EER.
  • Weaknesses: Struggles with noisy audio (VoxCeleb2) due to limited training on real-world noise. Top-5 accuracy (23%) is still way better than random (0.1%).

Limitations

  • Performs best on clean, studio-quality audio. Noisy environments (e.g., street interviews) reduce accuracy.
  • Trained only on English speech, so performance on other languages is untested.
  • Self-supervised loss (NT-Xent) didn’t boost performance as expected, possibly due to augmentations not matching real-world noise.

Future Improvements

  • Add augmentations for real-world noise (e.g., city sounds, background voices).
  • Train on more diverse, noisy datasets to improve robustness.

Citation

Inspired by Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings.

Downloads last month
0
Safetensors
Model size
20.5M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for johbac/voice-embedder-base

Finetuned
(508)
this model

Datasets used to train johbac/voice-embedder-base