Configuration

This model outlines the setup of a fine-tuned speaker diarization model with synthetic medical audio data.

Before starting, please ensure the requirements are met:

Install pyannote.audio 3.1 with pip install pyannote.audio
Accept pyannote/segmentation-3.0 user conditions
Accept pyannote/speaker-diarization-3.1 user conditions
Create access token at hf.co/settings/tokens.
Download pytorch_model.bin and config.yaml files into your local directory.

Usage

Load trained segmentation model

import torch
from pyannote.audio import Model

# Load the original architecture, will need to replace with your own auth token
model = Model.from_pretrained("pyannote/segmentation-3.0", use_auth_token=True)

# Path to the downloaded pytorch model
model_path = "models/pyannote_sd_normal"

# Load fine-tuned weights from the pytorch_model.bin file
model.load_state_dict(torch.load(model_path + "/pytorch_model.bin"))

Load fine-tuned speaker diarization pipeline

from pyannote.audio import Pipeline
from pyannote.metrics.diarization import DiarizationErrorRate
from pyannote.audio.pipelines import SpeakerDiarization

# Initialize the pyannote pipeline, will need to replace with your own auth token
pretrained_pipeline = Pipeline.from_pretrained(
    "pyannote/speaker-diarization-3.1",
    use_auth_token=True)

finetuned_pipeline = SpeakerDiarization(
    segmentation=model,
    embedding=pretrained_pipeline.embedding,
    embedding_exclude_overlap=pretrained_pipeline.embedding_exclude_overlap,
    clustering=pretrained_pipeline.klustering,
)

# Load fine-tuned params into the pipeline
finetuned_pipeline.load_params(model_path + "/config.yaml")

GPU usage

if torch.cuda.is_available():
    gpu = torch.device("cuda")
    finetuned_pipeline.to(gpu)
    print("gpu: ", torch.cuda.get_device_name(gpu))

Visualise diarization output

diarization = finetuned_pipeline("path/to/audio.wav")
diarization

View speaker turns, speaker ID, and time

for speech_turn, track, speaker in diarization.itertracks(yield_label=True):
    print(f"{speech_turn.start:4.1f} {speech_turn.end:4.1f} {speaker}")

Citations

@inproceedings{Plaquet23,
  author={Alexis Plaquet and Hervé Bredin},
  title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}

@inproceedings{Bredin23,
  author={Hervé Bredin},
  title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
  year=2023,
  booktitle={Proc. INTERSPEECH 2023},
}