nielsr's picture
nielsr HF Staff
Add pipeline tag, library_name and link to paper
2fa824d verified
|
raw
history blame
2.48 kB
metadata
license: mit
library_name: transformers
pipeline_tag: voice-activity-detection
tags:
  - speaker
  - speaker-diarization
  - meeting
  - wavlm
  - wespeaker
  - diarizen
  - pyannote
  - pyannote-audio-pipeline

Overview

This hub features the pre-trained model by DiariZen as described in BUT System for the MLC-SLM Challenge. The EEND component is built upon WavLM-Large and Conformer layers. The model was pre-trained on far-field, single-channel audio from a diverse set of public datasets, including AMI, AISHELL-4, AliMeeting, NOTSOFAR-1, MSDWild, DIHARD3, RAMC, and VoxConverse. Then structured pruning at 80% sparsity is applied. Finally, the pruned model is fine-tuned with MLC-SLM data.

Usage

from diarizen.pipelines.inference import DiariZenPipeline

# load pre-trained model
diar_pipeline = DiariZenPipeline.from_pretrained("BUT-FIT/diarizen-wavlm-large-s80-mlc")
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav')

# print results
for turn, _, speaker in diar_results.itertracks(yield_label=True):
    print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")

# load pre-trained model and save RTTM result
diar_pipeline = DiariZenPipeline.from_pretrained(
        "BUT-FIT/diarizen-wavlm-large-s80-mlc",
        rttm_out_dir='.'
)
# apply diarization pipeline
diar_results = diar_pipeline('audio.wav', sess_name='session_name')

Results

DER evaluation of Pyannote baseline and DiariZen, with no collar applied.

Dataset Pyannote DiariZen
English-American 20.18 15.88
English-Australian 13.76 10.82
English-British 18.85 12.07
English-Filipino 13.19 10.28
English-Indian 8.19 6.04
French 22.62 17.33
German 22.33 16.35
Italian 10.64 8.85
Japanese 26.46 17.81
Korean 23.25 16.36
Portuguese 17.60 14.77
Russian 11.37 9.99
Spanish 12.92 10.82
Thai 10.90 10.62
Vietnamese 14.64 12.69
Average 16.44 12.71