--- license: cc-by-4.0 language: - eu - gl - ca - es metrics: - perplexity tags: - kenlm - n-gram - language-model - lm - whisper - automatic-speech-recognition --- # Model Card for Whisper N-Gram Language Models ## Model Description These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models trained for supporting automatic speech recognition (ASR) tasks, specifically designed to work well with Whisper ASR models but are generally applicable to any ASR system requiring robust n-gram language models. These models can improve recognition accuracy by providing context-specific probabilities of word sequences. ## Intended Use These models are intended for use in language modeling tasks within ASR systems to improve prediction accuracy, especially in low-resource language scenarios. They can be integrated into any system that supports KenLM models. ## Model Details Each model is built using the KenLM toolkit and is based on n-gram statistics extracted from large, domain-specific corpora. The models available are: - **Basque (eu)**: `5gram-eu.bin` (11G) - **Galician (gl)**: `5gram-gl.bin` (8.4G) - **Catalan (ca)**: `5gram-ca.bin` (20G) - **Spanish (es)**: `5gram-es.bin` (13G) ## How to Use Here is an example of how to load and use the Basque model with KenLM in Python: ```python import kenlm from huggingface_hub import hf_hub_download filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin") model = kenlm.Model(filepath) print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True)) ``` ## Training Data The models were trained on corpora capped at 27 million sentences each to maintain comparability and manageability. Here's a breakdown of the sources for each language: * **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/) * **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora) * **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349) * **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/) Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the sentence cap. ## Model Performance The performance of these models varies by the specific language and the quality of the training data. Typically, performance is evaluated based on perplexity and the improvement in ASR accuracy when integrated. ## Considerations These models are designed for use in research and production for language-specific ASR tasks. They should be tested for bias and fairness to ensure appropriate use in diverse settings. ## Citation If you use these models in your research, please cite: ```bibtex @misc{dezuazo2025whisperlmimprovingasrmodels, title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages}, author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja}, year={2025}, eprint={2503.23542}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2503.23542}, } ``` And you can check the related paper preprint in [arXiv:2503.23542](https://arxiv.org/abs/2503.23542) for more details. ## Licensing This model is available under the [Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/). You are free to use, modify, and distribute this model as long as you credit the original creators. ## Acknowledgements We would like to express our gratitude to Niels Rogge for his guidance and support in the creation of this dataset repository. You can find more about his work at [his Hugging Face profile](https://huggingface.co/nielsr).