File size: 3,785 Bytes

1e06373

---
license: cc-by-4.0
language:
- eu
- gl
- ca
- es
metrics:
- perplexity
tags:
- kenlm
- n-gram
- language-model
- lm
- whisper
- automatic-speech-recognition
---

# Model Card for Whisper N-Gram Language Models

## Model Description

These models are [KenLM](https://kheafield.com/code/kenlm/) n-gram models
trained for supporting automatic speech recognition (ASR) tasks, specifically
designed to work well with Whisper ASR models but are generally applicable to
any ASR system requiring robust n-gram language models. These models can
improve recognition accuracy by providing context-specific probabilities of
word sequences.

## Intended Use

These models are intended for use in language modeling tasks within ASR systems
to improve prediction accuracy, especially in low-resource language scenarios.
They can be integrated into any system that supports KenLM models.

## Model Details

Each model is built using the KenLM toolkit and is based on n-gram statistics
extracted from large, domain-specific corpora. The models available are:

- **Basque (eu)**: `5gram-eu.bin` (11G)
- **Galician (gl)**: `5gram-gl.bin` (8.4G)
- **Catalan (ca)**: `5gram-ca.bin` (20G)
- **Spanish (es)**: `5gram-es.bin` (13G)

## How to Use

Here is an example of how to load and use the Basque model with KenLM in
Python:

```python
import kenlm
from huggingface_hub import hf_hub_download

filepath = hf_hub_download(repo_id="HiTZ/whisper-lm-ngrams", filename="5gram-eu.bin")
model = kenlm.Model(filepath)
print(model.score("talka diskoetxearekin grabatzen ditut beti abestien maketak", bos=True, eos=True))
```

## Training Data

The models were trained on corpora capped at 27 million sentences each to
maintain comparability and manageability. Here's a breakdown of the sources for
each language:

* **Basque**: [EusCrawl 1.0](https://www.ixa.eus/euscrawl/)

* **Galician**: [SLI GalWeb Corpus](https://github.com/xavier-gz/SLI_Galician_Corpora)

* **Catalan**: [Catalan Textual Corpus](https://zenodo.org/records/4519349)

* **Spanish**: [Spanish LibriSpeech MLS](https://openslr.org/94/)

Additional data from recent [Wikipedia dumps](https://dumps.wikimedia.org/) and
the [Opus corpus](https://opus.nlpl.eu/) were used as needed to reach the
sentence cap.

## Model Performance

The performance of these models varies by the specific language and the quality
of the training data. Typically, performance is evaluated based on perplexity
and the improvement in ASR accuracy when integrated.

## Considerations

These models are designed for use in research and production for
language-specific ASR tasks. They should be tested for bias and fairness to
ensure appropriate use in diverse settings.

## Citation

If you use these models in your research, please cite:

```bibtex
@misc{dezuazo2025whisperlmimprovingasrmodels,
      title={Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages}, 
      author={Xabier de Zuazo and Eva Navas and Ibon Saratxaga and Inma Hernáez Rioja},
      year={2025},
      eprint={2503.23542},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.23542}, 
}
```

And you can check the related paper preprint in
[arXiv:2503.23542](https://arxiv.org/abs/2503.23542)
for more details.

## Licensing

This model is available under the
[Creative Commons Attribution 4.0 International License (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/).
You are free to use, modify, and distribute this model as long as you credit
the original creators.

## Acknowledgements

We would like to express our gratitude to Niels Rogge for his guidance and
support in the creation of this dataset repository. You can find more about his
work at [his Hugging Face profile](https://huggingface.co/nielsr).