File size: 3,785 Bytes
1e06373 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 |
---
license: cc-by-4.0
language:
- eu
- gl
- ca
- es
metrics:
- perplexity
tags:
- kenlm
- n-gram
- language-model
- lm
- whisper
- automatic-speech-recognition
---
# Model Card for Whisper N-Gram Language Models
## Model Description
These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models
trained for supporting automatic speech recognition (ASR) tasks, specifically
designed to work well with Whisper ASR models but are generally applicable to
any ASR system requiring robust n-gram language models. These models can
improve recognition accuracy by providing context-specific probabilities of
word sequences.
## Intended Use
These models are intended for use in language modeling tasks within ASR systems
to improve prediction accuracy, especially in low-resource language scenarios.
They can be integrated into any system that supports KenLM models.
## Model Details
Each model is built using the KenLM toolkit and is based on n-gram statistics
extracted from large, domain-specific corpora. The models available are:
- **Basque (eu)**: `5gram-eu.bin` (11G)
- **Galician (gl)**: `5gram-gl.bin` (8.4G)
- **Catalan (ca)**: `5gram-ca.bin` (20G)
- **Spanish (es)**: `5gram-es.bin` (13G)
## How to Use
Here is an example of how to load and use the Basque model with KenLM in
Python:
```python
import kenlm
from huggingface_hub import hf_hub_download
filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
model = kenlm.Model(filepath)
print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
```
## Training Data
The models were trained on corpora capped at 27 million sentences each to
maintain comparability and manageability. Here's a breakdown of the sources for
each language:
* **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/)
* **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora)
* **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349)
* **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/)
Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and
the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the
sentence cap.
## Model Performance
The performance of these models varies by the specific language and the quality
of the training data. Typically, performance is evaluated based on perplexity
and the improvement in ASR accuracy when integrated.
## Considerations
These models are designed for use in research and production for
language-specific ASR tasks. They should be tested for bias and fairness to
ensure appropriate use in diverse settings.
## Citation
If you use these models in your research, please cite:
```bibtex
@misc{dezuazo2025whisperlmimprovingasrmodels,
title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages},
author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
year={2025},
eprint={2503.23542},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.23542},
}
```
And you can check the related paper preprint in
[arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
for more details.
## Licensing
This model is available under the
[Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
You are free to use, modify, and distribute this model as long as you credit
the original creators.
## Acknowledgements
We would like to express our gratitude to Niels Rogge for his guidance and
support in the creation of this dataset repository. You can find more about his
work at [his Hugging Face profile](https://huggingface.co/nielsr). |