zuazo commited on
Commit
1e06373
·
verified ·
1 Parent(s): 4b8d958

Complete the README.md.

Browse files
Files changed (1) hide show
  1. README.md +121 -3
README.md CHANGED
@@ -1,3 +1,121 @@
1
- ---
2
- license: cc-by-4.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-4.0
3
+ language:
4
+ - eu
5
+ - gl
6
+ - ca
7
+ - es
8
+ metrics:
9
+ - perplexity
10
+ tags:
11
+ - kenlm
12
+ - n-gram
13
+ - language-model
14
+ - lm
15
+ - whisper
16
+ - automatic-speech-recognition
17
+ ---
18
+
19
+ # Model Card for Whisper N-Gram Language Models
20
+
21
+ ## Model Description
22
+
23
+ These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models
24
+ trained for supporting automatic speech recognition (ASR) tasks, specifically
25
+ designed to work well with Whisper ASR models but are generally applicable to
26
+ any ASR system requiring robust n-gram language models. These models can
27
+ improve recognition accuracy by providing context-specific probabilities of
28
+ word sequences.
29
+
30
+ ## Intended Use
31
+
32
+ These models are intended for use in language modeling tasks within ASR systems
33
+ to improve prediction accuracy, especially in low-resource language scenarios.
34
+ They can be integrated into any system that supports KenLM models.
35
+
36
+ ## Model Details
37
+
38
+ Each model is built using the KenLM toolkit and is based on n-gram statistics
39
+ extracted from large, domain-specific corpora. The models available are:
40
+
41
+ - **Basque (eu)**: `5gram-eu.bin` (11G)
42
+ - **Galician (gl)**: `5gram-gl.bin` (8.4G)
43
+ - **Catalan (ca)**: `5gram-ca.bin` (20G)
44
+ - **Spanish (es)**: `5gram-es.bin` (13G)
45
+
46
+ ## How to Use
47
+
48
+ Here is an example of how to load and use the Basque model with KenLM in
49
+ Python:
50
+
51
+ ```python
52
+ import kenlm
53
+ from huggingface_hub import hf_hub_download
54
+
55
+ filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
56
+ model = kenlm.Model(filepath)
57
+ print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
58
+ ```
59
+
60
+ ## Training Data
61
+
62
+ The models were trained on corpora capped at 27 million sentences each to
63
+ maintain comparability and manageability. Here's a breakdown of the sources for
64
+ each language:
65
+
66
+ * **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/)
67
+
68
+ * **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora)
69
+
70
+ * **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349)
71
+
72
+ * **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/)
73
+
74
+ Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and
75
+ the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the
76
+ sentence cap.
77
+
78
+ ## Model Performance
79
+
80
+ The performance of these models varies by the specific language and the quality
81
+ of the training data. Typically, performance is evaluated based on perplexity
82
+ and the improvement in ASR accuracy when integrated.
83
+
84
+ ## Considerations
85
+
86
+ These models are designed for use in research and production for
87
+ language-specific ASR tasks. They should be tested for bias and fairness to
88
+ ensure appropriate use in diverse settings.
89
+
90
+ ## Citation
91
+
92
+ If you use these models in your research, please cite:
93
+
94
+ ```bibtex
95
+ @misc{dezuazo2025whisperlmimprovingasrmodels,
96
+ title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
97
+ author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
98
+ year={2025},
99
+ eprint={2503.23542},
100
+ archivePrefix={arXiv},
101
+ primaryClass={cs.CL},
102
+ url={https://arxiv.org/abs/2503.23542},
103
+ }
104
+ ```
105
+
106
+ And you can check the related paper preprint in
107
+ [arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
108
+ for more details.
109
+
110
+ ## Licensing
111
+
112
+ This model is available under the
113
+ [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
114
+ You are free to use, modify, and distribute this model as long as you credit
115
+ the original creators.
116
+
117
+ ## Acknowledgements
118
+
119
+ We would like to express our gratitude to Niels Rogge for his guidance and
120
+ support in the creation of this dataset repository. You can find more about his
121
+ work at [his Hugging Face profile](https://huggingface.co/nielsr).