parakeet-rnnt-1.1b_cv17_es_ep18_1270h

Click to expand

Paper
Model Summary
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Paper

PDF: Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0

Model Summary

The "parakeet-rnnt-1.1b_cv17_es_ep18_1270h" is an acoustic model based on "nvidia/parakeet-rnnt-1.1b" suitable for Automatic Speech Recognition in Spanish.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Spanish. The model is intended to transcribe audio files in Spanish to plain text without punctuation.

How to Get Started with the Model

To see an updated and functional version of this code, please the NVIDIA's official repository

Installation

In order to use this model, you may install the NVIDIA NeMo Framework:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

BRANCH = 'main'
python -m pip install git+https://github.com/NVIDIA/NeMo.git@$BRANCH#egg=nemo_toolkit[all]

For Inference

In order to transcribe audio in Spanish using this model, you can follow this example:

import nemo.collections.asr as nemo_asr

asr_model = nemo_asr.models.EncDecRNNTBPEModel.from_pretrained(model_name="projecte-aina/parakeet-rnnt-1.1b_cv17_es_ep18_1270h")

output = asr_model.transcribe(['YOUR_WAV_FILE.wav'])
print(output[0].text)

Training Details

Training data

The specific datasets used to create the model are the "cv17_es_other_automatically_verified" (784 hours and 50 minutes) in combination with around 485 hours of Spanish data taken from the split called "validated" of Mozilla Common Voice 17.0

Training procedure

This model is the result of finetuning the model "parakeet-rnnt-1.1b" by following this tutorial

Training Hyperparameters

language: spanish
hours of training audio: 1270
learning rate: 2e-4
devices=4
num_nodes=8
batch_size=8
accelerator=accelerator
strategy="ddp"
max_epochs=50
enable_checkpointing=True
logger=False
log_every_n_steps=100
check_val_every_n_epoch=1
precision='bf16-mixed'
callbacks=[checkpoint_callback]

Citation

If this model contributes to your research, please cite the work:

@inproceedings{mena2025automatic,
  title={Automatic Validation of the Non-Validated Spanish Speech Data of Common Voice 17.0},
  author={Mena, Carlos Daniel Hern{\'a}ndez and Scalvini, Barbara and {\'\i} L{\'a}g, D{\'a}vid},
  booktitle={Proceedings of the Third Workshop on Resources and Representations for Under-Resourced Languages and Domains (RESOURCEFUL-2025)},
  pages={58--63},
  year={2025}
}

Additional Information

Author

The fine-tuning process was perform during November (2024) in the Language Technologies Unit of the Barcelona Supercomputing Center by Carlos Daniel Hernández Mena.

Contact

For further information, please send an email to [email protected].

Copyright

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

projecte-aina
/

parakeet-rnnt-1.1b_cv17_es_ep18_1270h