HiTZ
/

whisper-lm-ngrams

+---
+license: cc-by-4.0
+language:
+- eu
+- gl
+- ca
+- es
+metrics:
+- perplexity
+tags:
+- kenlm
+- n-gram
+- language-model
+- lm
+- whisper
+- automatic-speech-recognition
+---
+# Model Card for Whisper N-Gram Language Models
+## Model Description
+These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models
+trained for supporting automatic speech recognition (ASR) tasks, specifically
+designed to work well with Whisper ASR models but are generally applicable to
+any ASR system requiring robust n-gram language models. These models can
+improve recognition accuracy by providing context-specific probabilities of
+word sequences.
+## Intended Use
+These models are intended for use in language modeling tasks within ASR systems
+to improve prediction accuracy, especially in low-resource language scenarios.
+They can be integrated into any system that supports KenLM models.
+## Model Details
+Each model is built using the KenLM toolkit and is based on n-gram statistics
+extracted from large, domain-specific corpora. The models available are:
+- **Basque (eu)**: `5gram-eu.bin` (11G)
+- **Galician (gl)**: `5gram-gl.bin` (8.4G)
+- **Catalan (ca)**: `5gram-ca.bin` (20G)
+- **Spanish (es)**: `5gram-es.bin` (13G)
+## How to Use
+Here is an example of how to load and use the Basque model with KenLM in
+Python:
+```python
+import kenlm
+from huggingface_hub import hf_hub_download
+filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
+model = kenlm.Model(filepath)
+print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
+```
+## Training Data
+The models were trained on corpora capped at 27 million sentences each to
+maintain comparability and manageability. Here's a breakdown of the sources for
+each language:
+* **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/)
+* **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora)
+* **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349)
+* **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/)
+Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and
+the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the
+sentence cap.
+## Model Performance
+The performance of these models varies by the specific language and the quality
+of the training data. Typically, performance is evaluated based on perplexity
+and the improvement in ASR accuracy when integrated.
+## Considerations
+These models are designed for use in research and production for
+language-specific ASR tasks. They should be tested for bias and fairness to
+ensure appropriate use in diverse settings.
+## Citation
+If you use these models in your research, please cite:
+```bibtex
+@misc{dezuazo2025whisperlmimprovingasrmodels,
+      title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
+      author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
+      year={2025},
+      eprint={2503.23542},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2503.23542},
+}
+```
+And you can check the related paper preprint in
+[arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
+for more details.
+## Licensing
+This model is available under the
+[Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
+You are free to use, modify, and distribute this model as long as you credit
+the original creators.
+## Acknowledgements
+We would like to express our gratitude to Niels Rogge for his guidance and
+support in the creation of this dataset repository. You can find more about his
+work at [his Hugging Face profile](https://huggingface.co/nielsr).