library_name: transformers
tags:
- speech
- automatic-speech-recognition
- whisper
- multilingual
- fine-tuned
- mlc-slm
- speaker-diarization
- meeting-transcription
- DiCoW
- BUT-FIT
pipeline_tag: automatic-speech-recognition
license: apache-2.0
datasets:
- microsoft/NOTSOFAR
- edinburghcstr/ami
DiCoW_v3_MLC — BUT-FIT Model for MLC-SLM Challenge
This repository contains the DiCoW_v3_MLC model developed by BUT Speech@FIT for the MLC-SLM Challenge. Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information.
The model is described in detail in the following papers:
- 📰 Journal paper (main DiCoW paper): DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition
- 📰 ICASSP paper (initial DiCoW experiments): Target Speaker ASR with Whisper
- 📰 MLC-SLM Challenge submission paper: BUT System for the MLC-SLM Challenge
Model Summary
The model is based on Whisper large-v3-turbo, initially trained on:
- NOTSOFAR-1
- AMI Meeting Corpus
- Libri2Mix dataset
It is then fine-tuned on the MLC-SLM dataset as part of the MLC-SLM Challenge.
Model Details
- Developed by: BUT Speech@FIT, Brno University of Technology
- Model type: Whisper large-v3-turbo + DiCoW composition
- Language(s): Multilingual (primarily English, but supports multiple languages)
- License: apache-2.0
- Fine-tuned from: openai/whisper-large-v3-turbo
- Challenge: MLC-SLM (Multilingual Conversational Speech Language Model)
Model Sources
- Training Code: TS-ASR-Whisper GitHub
- Inference Code & DiCoW framework: DiCoW GitHub
Getting Started
from transformers import AutoModelForSpeechSeq2Seq
MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC"
dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True)
For detailed inference and full pipelines, refer to: 👉 DiCoW GitHub inference repo
tcpWER/CER (%) on the MLC-SLM development set
Language | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) |
---|---|---|---|---|---|---|
American En. | 14.1 | 20.6 | 11.1 | 53.7 | 36.5 | 22.5 |
Australian En. | 11.7 | 19.4 | 7.4 | 52.6 | 23.6 | 13.0 |
British En. | 10.1 | 16.7 | 7.7 | 71.9 | 26.1 | 17.6 |
Filipino En. | 9.2 | 17.7 | 7.5 | 50.4 | 25.5 | 15.2 |
Indian En. | 14.0 | 14.3 | 13.3 | 70.7 | 14.9 | 14.0 |
French | 28.1 | 27.7 | 16.1 | 96.0 | 37.8 | 27.5 |
German | 20.7 | 21.2 | 23.9 | 86.7 | 30.1 | 27.3 |
Italian | 17.9 | 16.2 | 12.3 | 83.3 | 19.8 | 16.4 |
Japanese (*) | 21.6 | 19.2 | 13.7 | 71.3 | 25.8 | 23.3 |
Korean (*) | 13.8 | 12.8 | 8.5 | 59.6 | 24.5 | 22.8 |
Portuguese | 21.2 | 24.5 | 19.5 | 118.8 | 33.1 | 29.7 |
Russian | 17.7 | 17.6 | 11.6 | 69.2 | 22.5 | 16.7 |
Spanish | 12.3 | 11.6 | 8.7 | 75.6 | 18.2 | 16.3 |
Thai (*) | 14.5 | 31.9 | 14.2 | 83.6 | 34.4 | 20.1 |
Vietnamese | 27.2 | 30.0 | 15.3 | 82.8 | 33.8 | 24.7 |
Overall | 16.8 | 22.0 | 12.9 | 76.1 | 28.4 | 20.8 |
Results marked with an asterisk () are reported using tcpCER, following the official evaluation protocol.*
Notes:
- GT = Ground-Truth Segmentation
- Real diar = Real Diarization
- Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization.
- DiCoW uses fine-tuned DiariZen diarization.
Citation
If you use this model, please cite:
@article{POLOK2026101841,
title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition},
journal = {Computer Speech & Language},
volume = {95},
pages = {101841},
year = {2026},
issn = {0885-2308},
doi = {https://doi.org/10.1016/j.csl.2025.101841},
url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X},
author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget},
keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation},
}
@INPROCEEDINGS{10887683,
author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš},
booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={Target Speaker ASR with Whisper},
year={2025},
volume={},
number={},
pages={1-5},
keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper},
doi={10.1109/ICASSP49660.2025.10887683}
}
@misc{polok2025mlcslmchallenge,
title={BUT System for the MLC-SLM Challenge},
author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget},
year={2025},
eprint={2506.13414},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2506.13414},
}
Contact
For questions or collaborations, feel free to email: [email protected]
BUT Speech@FIT, Brno University of Technology
GitHub: BUTSpeechFIT