metadata

library_name: transformers
tags:
  - speech
  - automatic-speech-recognition
  - whisper
  - multilingual
  - fine-tuned
  - mlc-slm
  - speaker-diarization
  - meeting-transcription
  - DiCoW
  - BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
  - microsoft/NOTSOFAR
  - edinburghcstr/ami

DiCoW_v3_MLC — BUT-FIT Model for MLC-SLM Challenge

This repository contains the DiCoW_v3_MLC model developed by BUT Speech@FIT for the MLC-SLM Challenge. Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.

The model is described in detail in the following papers:

📰 Journal paper (main DiCoW paper): DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
📰 ICASSP paper (initial DiCoW experiments): Target Speaker ASR with Whisper
📰 MLC-SLM Challenge submission paper: BUT System for the MLC-SLM Challenge

Model Summary

The model is based on Whisper large-v3-turbo, initially trained on:

NOTSOFAR-1
AMI Meeting Corpus
Libri2Mix dataset

It is then fine-tuned on the MLC-SLM dataset as part of the MLC-SLM Challenge.

Model Details

Developed by: BUT Speech@FIT, Brno University of Technology
Model type: Whisper large-v3-turbo + DiCoW composition
Language(s): Multilingual (primarily English, but supports multiple languages)
License: apache-2.0
Fine-tuned from: openai/whisper-large-v3-turbo
Challenge: MLC-SLM (Multilingual Conversational Speech Language Model)

Model Sources

Training Code: TS-ASR-Whisper GitHub
Inference Code & DiCoW framework: DiCoW GitHub

Getting Started

from transformers import AutoModelForSpeechSeq2Seq

MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)

For detailed inference and full pipelines, refer to: 👉 DiCoW GitHub inference repo

tcpWER/CER (%) on the MLC-SLM development set

Language	Baseline (GT)	DiCoW (GT)	FT (GT)	Baseline (Real diar)	DiCoW (Real diar)	FT (Real diar)
American En.	14.1	20.6	11.1	53.7	36.5	22.5
Australian En.	11.7	19.4	7.4	52.6	23.6	13.0
British En.	10.1	16.7	7.7	71.9	26.1	17.6
Filipino En.	9.2	17.7	7.5	50.4	25.5	15.2
Indian En.	14.0	14.3	13.3	70.7	14.9	14.0
French	28.1	27.7	16.1	96.0	37.8	27.5
German	20.7	21.2	23.9	86.7	30.1	27.3
Italian	17.9	16.2	12.3	83.3	19.8	16.4
Japanese (*)	21.6	19.2	13.7	71.3	25.8	23.3
Korean (*)	13.8	12.8	8.5	59.6	24.5	22.8
Portuguese	21.2	24.5	19.5	118.8	33.1	29.7
Russian	17.7	17.6	11.6	69.2	22.5	16.7
Spanish	12.3	11.6	8.7	75.6	18.2	16.3
Thai (*)	14.5	31.9	14.2	83.6	34.4	20.1
Vietnamese	27.2	30.0	15.3	82.8	33.8	24.7
Overall	16.8	22.0	12.9	76.1	28.4	20.8

Results marked with an asterisk () are reported using tcpCER, following the official evaluation protocol.*

Notes:

GT = Ground-Truth Segmentation
Real diar = Real Diarization
Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.
DiCoW uses fine-tuned DiariZen diarization.

Citation

If you use this model, please cite:

@article{POLOK2026101841,
    title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
    journal = {Computer Speech & Language},
    volume = {95},
    pages = {101841},
    year = {2026},
    issn = {0885-2308},
    doi = {https://doi.org/10.1016/j.csl.2025.101841},
    url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
    author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
    keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}

@INPROCEEDINGS{10887683,
    author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
    booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, 
    title={Target Speaker ASR with Whisper}, 
    year={2025},
    volume={},
    number={},
    pages={1-5},
    keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
    doi={10.1109/ICASSP49660.2025.10887683}
}

@misc{polok2025mlcslmchallenge,
    title={BUT System for the MLC-SLM Challenge}, 
    author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
    year={2025},
    eprint={2506.13414},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2506.13414}, 
}

Contact

For questions or collaborations, feel free to email: [email protected]
BUT Speech@FIT, Brno University of Technology
GitHub: BUTSpeechFIT