--- library_name: transformers tags: - speech - automatic-speech-recognition - whisper - multilingual - fine-tuned - mlc-slm - speaker-diarization - meeting-transcription - DiCoW - BUT-FIT pipeline_tag: automatic-speech-recognition license: apache-2.0 datasets: - microsoft/NOTSOFAR - edinburghcstr/ami --- # DiCoW\_v3\_MLC — BUT-FIT Model for MLC-SLM Challenge This repository contains the **DiCoW\_v3\_MLC** model developed by [BUT Speech@FIT](https://github.com/BUTSpeechFIT) for the [MLC-SLM Challenge](https://www.nexdata.ai/competition/mlc-slm). Diarization-Conditioned Whisper (DiCoW) is a novel approach to target-speaker ASR that leverages speaker diarization outputs as conditioning information. The model is described in detail in the following papers: * 📰 **Journal paper (main DiCoW paper):** [DiCoW: Diarization-Conditioned Whisper for Target Speaker Automatic Speech Recognition](https://authors.elsevier.com/a/1lI9m_K8BYumVY) * 📰 **ICASSP paper (initial DiCoW experiments):** [Target Speaker ASR with Whisper](https://ieeexplore.ieee.org/document/10887683) * 📰 **MLC-SLM Challenge submission paper:** [BUT System for the MLC-SLM Challenge](https://www.arxiv.org/abs/2506.13414) ## Model Summary The model is based on **Whisper large-v3-turbo**, initially trained on: * **NOTSOFAR-1** * **AMI** Meeting Corpus * **Libri2Mix** dataset It is then fine-tuned on the **MLC-SLM dataset** as part of the MLC-SLM Challenge. ## Model Details * **Developed by:** BUT Speech\@FIT, Brno University of Technology * **Model type:** Whisper large-v3-turbo + DiCoW composition * **Language(s):** Multilingual (primarily English, but supports multiple languages) * **License:** apache-2.0 * **Fine-tuned from:** openai/whisper-large-v3-turbo * **Challenge:** MLC-SLM (Multilingual Conversational Speech Language Model) ## Model Sources * **Training Code:** [TS-ASR-Whisper GitHub](https://github.com/BUTSpeechFIT/TS-ASR-Whisper) * **Inference Code & DiCoW framework:** [DiCoW GitHub](https://github.com/BUTSpeechFIT/DiCoW) ## Getting Started ```python from transformers import AutoModelForSpeechSeq2Seq MODEL_NAME = "BUT-FIT/DiCoW_v3_MLC" dicow = AutoModelForSpeechSeq2Seq.from_pretrained(MODEL_NAME, trust_remote_code=True) ``` For detailed inference and full pipelines, refer to: 👉 [DiCoW GitHub inference repo](https://github.com/BUTSpeechFIT/DiCoW) ### tcpWER/CER (%) on the MLC-SLM development set | Language | Baseline (GT) | DiCoW (GT) | FT (GT) | Baseline (Real diar) | DiCoW (Real diar) | FT (Real diar) | |----------------|---------------|------------|---------|-----------------------|-------------------|----------------| | American En. | 14.1 | 20.6 | 11.1 | 53.7 | 36.5 | 22.5 | | Australian En. | 11.7 | 19.4 | 7.4 | 52.6 | 23.6 | 13.0 | | British En. | 10.1 | 16.7 | 7.7 | 71.9 | 26.1 | 17.6 | | Filipino En. | 9.2 | 17.7 | 7.5 | 50.4 | 25.5 | 15.2 | | Indian En. | 14.0 | 14.3 | 13.3 | 70.7 | 14.9 | 14.0 | | French | 28.1 | 27.7 | 16.1 | 96.0 | 37.8 | 27.5 | | German | 20.7 | 21.2 | 23.9 | 86.7 | 30.1 | 27.3 | | Italian | 17.9 | 16.2 | 12.3 | 83.3 | 19.8 | 16.4 | | Japanese (\*) | 21.6 | 19.2 | 13.7 | 71.3 | 25.8 | 23.3 | | Korean (\*) | 13.8 | 12.8 | 8.5 | 59.6 | 24.5 | 22.8 | | Portuguese | 21.2 | 24.5 | 19.5 | 118.8 | 33.1 | 29.7 | | Russian | 17.7 | 17.6 | 11.6 | 69.2 | 22.5 | 16.7 | | Spanish | 12.3 | 11.6 | 8.7 | 75.6 | 18.2 | 16.3 | | Thai (\*) | 14.5 | 31.9 | 14.2 | 83.6 | 34.4 | 20.1 | | Vietnamese | 27.2 | 30.0 | 15.3 | 82.8 | 33.8 | 24.7 | | **Overall** | **16.8** | **22.0** | **12.9**| **76.1** | **28.4** | **20.8** | > *Results marked with an asterisk (*) are reported using tcpCER, following the official evaluation protocol.* **Notes:** - GT = Ground-Truth Segmentation - Real diar = Real Diarization - Baseline uses Whisper large-v3 with chunked inference + finetunned Pyannote diarization. - DiCoW uses fine-tuned DiariZen diarization. ## Citation If you use this model, please cite: ```bibtex @article{POLOK2026101841, title = {DiCoW: Diarization-conditioned Whisper for target speaker automatic speech recognition}, journal = {Computer Speech & Language}, volume = {95}, pages = {101841}, year = {2026}, issn = {0885-2308}, doi = {https://doi.org/10.1016/j.csl.2025.101841}, url = {https://www.sciencedirect.com/science/article/pii/S088523082500066X}, author = {Alexander Polok and Dominik Klement and Martin Kocour and Jiangyu Han and Federico Landini and Bolaji Yusuf and Matthew Wiesner and Sanjeev Khudanpur and Jan Černocký and Lukáš Burget}, keywords = {Diarization-conditioned Whisper, Target-speaker ASR, Speaker diarization, Long-form ASR, Whisper adaptation}, } @INPROCEEDINGS{10887683, author={Polok, Alexander and Klement, Dominik and Wiesner, Matthew and Khudanpur, Sanjeev and Černocký, Jan and Burget, Lukáš}, booktitle={ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)}, title={Target Speaker ASR with Whisper}, year={2025}, volume={}, number={}, pages={1-5}, keywords={Transforms;Signal processing;Transformers;Acoustics;Speech processing;target-speaker ASR;diarization conditioning;multi-speaker ASR;Whisper}, doi={10.1109/ICASSP49660.2025.10887683} } @misc{polok2025mlcslmchallenge, title={BUT System for the MLC-SLM Challenge}, author={Alexander Polok and Jiangyu Han and Dominik Klement and Samuele Cornell and Jan Černocký and Lukáš Burget}, year={2025}, eprint={2506.13414}, archivePrefix={arXiv}, primaryClass={eess.AS}, url={https://arxiv.org/abs/2506.13414}, } ``` ## Contact For questions or collaborations, feel free to email: [ipoloka@fit.vut.cz](mailto:ipoloka@fit.vut.cz) **BUT Speech@FIT, Brno University of Technology** GitHub: [BUTSpeechFIT](https://github.com/BUTSpeechFIT)