--- license: apache-2.0 base_model: - AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19 pipeline_tag: automatic-speech-recognition --- # General-purpose Latgalian ASR model This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets: - the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets), - the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR). ## Training As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library. | Training data | Hours | |:---|---:| | Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 | | Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 | | Total | 40.2 | ## Evaluation | Testing data | WER | |:---|---:| | Latgalian CV 20.0 test set (1.5 hours) | 9.1 | | MuLaR test set (1.6 hours) | 25.7 | NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus. ## Acknowledgements This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003).