File size: 1,693 Bytes
5892446
 
 
 
 
0426880
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
461f5d7
 
 
 
 
 
0426880
 
 
5638421
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
---
license: apache-2.0
base_model:
- AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19
pipeline_tag: automatic-speech-recognition
---

# General-purpose Latgalian ASR model

This is a fine-tuned [whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) model for [Latgalian](https://en.wikipedia.org/wiki/Latgalian_language), trained by [AiLab.lv](https://ailab.lv) using two general-purpose speech datasets: 
- the Latgalian part of [Common Voice 20.0](https://commonvoice.mozilla.org/ltg/datasets), 
- the Corpus of Contemporary Latgalian Speech [MuLaR](https://korpuss.lv/id/MuLaR).

## Training

As a base model, we used a previously fine-tuned ASR model for [Latvian](https://huggingface.co/AiLab-IMCS-UL/whisper-large-v3-lv-late-cv19), and continued to fine-tune it for Latgalian. The fine-tuning was done using the Hugging Face Transformers library.

| Training data | Hours |
|:---|---:|
| Latgalian Common Voice 20.0 train set (a [VW split](https://analyzer.cv-toolbox.web.tr)) | 22.9 |
| Corpus of Contemporary Latgalian Speech (MuLaR) train set | 17.3 |
| Total | 40.2 |

## Evaluation

| Testing data | WER |
|:---|---:|
| Latgalian CV 20.0 test set (1.5 hours) | 9.1 |
| MuLaR test set (1.6 hours) | 25.7 |

NB! The MuLaR corpus contains transcriptions that generally do not follow the rules of the standard Latgalian orthography, in contrast to the Latgalian CV corpus.

## Acknowledgements

This work was supported by the EU Recovery and Resilience Facility project [Language Technology Initiative](https://www.vti.lu.lv) (2.3.1.1.i.0/1/22/I/CFLA/002) in synergy with the State Research Programme project "Diversity of Latvian in Time and Space" (VPP-LETONIKA-2021/4-0003).