metadata

license: apache-2.0
base_model:
  - AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
pipeline_tag: automatic-speech-recognition

General-purpose Latgalian ASR model

This is a fine-tuned whisper-large-v3 model for Latgalian, trained by AiLab.lv using two general-purpose speech datasets:

the Latgalian part of Common Voice 20.0,
the Corpus of Contemporary Latgalian Speech MuLaR.

Training

As a base model, we used a previously fine-tuned ASR model for Latvian, and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.

Training data	Hours
Latgalian Common Voice 20.0 train set (a VW split)	22.9
Corpus of Contemporary Latgalian Speech (MuLaR) train set	17.3
Total	40.2

Evaluation

Testing data	WER
Latgalian CV 20.0 test set (1.5 hours)	9.1
MuLaR test set (1.6 hours)	25.7

NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus.

Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project Language Technology Initiative (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003).