metadata

license: cc-by-4.0
language:
  - ca
  - es
base_model:
  - nvidia/stt_es_conformer_transducer_large
tags:
  - automatic-speech-recognition
  - NeMo
model-index:
  - name: stt_ca-es_conformer_transducer_large
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          config: ca
          split: test
          args:
            language: ca
        metrics:
          - name: Test WER
            type: wer
            value: 2.503
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: Mozilla Common Voice 17.0
          type: mozilla-foundation/common_voice_17_0
          config: ca
          split: test
          args:
            language: es
        metrics:
          - name: Test WER
            type: wer
            value: 3.88

NVIDIA Conformer-Transducer Large (ca-es)

Click to expand

Model Description
Intended Uses and Limitations
How to Get Started with the Model
Training Details
Citation
Additional Information

Summary

The "stt_ca-es_conformer_transducer_large" is an acoustic model based on "NVIDIA/stt_es_conformer_transducer_large" suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.

Model Description

This model transcribes speech in lowercase Catalan and Spanish alphabet including spaces, and was Fine-tuned on a Bilingual ca-es dataset comprising of 7426 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. See the model architecture section and NeMo documentation for complete architecture details.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text without punctuation.

Installation

To use this model, Install NVIDIA NeMo. We recommend you install it after you've installed the latest Pytorch version.

pip install nemo_toolkit['all']

For Inference

To transcribe audio in Catalan or in Spanish language using this model, you can follow this example:

import nemo.collections.asr as nemo_asr

nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
transcription = nemo_asr_model.transcribe([audio_path])[0][0]
print(transcription)

Training Details

Training data

The model was trained on bilingual datasets in Catalan and Spanish, for a total of 7426 hours.

Training procedure

This model is the result of finetuning the base model "Nvidia/stt_es_conformer_transducer_large" by following this tutorial.

Citation

If this model contributes to your research, please cite the work:

@misc{mena2024whisperlarge3catparla,
      title={Bilingual ca-es ASR Model: stt_ca-es_conformer_transducer_large.}, 
      author={Messaoudi, Abir; Külebi, Baybars},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large},
      year={2024}
}

Additional Information

Author

The fine-tuning process was performed during 2024 in the Language Technologies Unit of the Barcelona Supercomputing Center by Abir Messaoudi.

Contact

For further information, please send an email to [email protected].

Copyright

License

CC-BY-4.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

projecte-aina
/

stt_ca-es_conformer_transducer_large