File size: 1,693 Bytes
5892446 0426880 461f5d7 0426880 5638421 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
---
license: apache-2.0
base_model:
- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
pipeline_tag: automatic-speech-recognition
---
# General-purpose Latgalian ASR model
This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets:
- the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets),
- the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR).
## Training
As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.
| Training data | Hours |
|:---|---:|
| Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 |
| Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 |
| Total | 40.2 |
## Evaluation
| Testing data | WER |
|:---|---:|
| Latgalian CV 20.0 test set (1.5 hours) | 9.1 |
| MuLaR test set (1.6 hours) | 25.7 |
NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus.
## Acknowledgements
This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003).
|