File size: 4,005 Bytes
0d9e8fd |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
datasets:
- facebook/multilingual_librispeech
- Parlament-Parla-v1
- gttsehu/basque_parliament_1
- facebook/voxpopuli
- johnatanebonilla/coser_lv_full
- collectivat/tv3_parla
- mozilla-foundation/common_voice_16_0
language:
- es
- ca
metrics:
- wer
- cer
tags:
- automatic-speech-recognition
- speech
- multilingual
- nemo
model-index:
- name: Mohammed-Alzahrani-ai/stt_ca-es_conformer_transducer_large_fine_tuned
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
type: automatic-speech-recognition
name: Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
metrics:
- name: WER (Spanish)
type: wer
value: 0.08
- name: CER (Spanish)
type: cer
value: 0.04
- name: WER (Catalan)
type: wer
value: 0.10
- name: CER (Catalan)
type: cer
value: 0.05
---
# NVIDIA Conformer-Transducer Large (ca-es)
## Table of Contents
<details>
<summary>Click to expand</summary>
- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Citation](#citation)
- [Additional Information](#additional-information)
</details>
## Summary
The "stt_ca-es_conformer_transducer_large" is an acoustic model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/) suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.
## Model Description
This model transcribes speech, and was fine-tuned on a Bilingual ca-es dataset comprising of 4000 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. We expaneded it is tokenizer vocab sise to be 5.5k t oinclude lowercase, uppercase, and punctuation
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
## Intended Uses and Limitations
This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text with punctuation.
### Installation
To use this model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
```
pip install nemo_toolkit['all']
```
### For Inference
To transcribe audio in Catalan or in Spanish using this model, you can follow this example:
```python
import nemo.collections.asr as nemo_asr
nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
transcription = nemo_asr_model.transcribe([audio_path])[0].text
print(transcription)
```
## Training Details
### Training data
The model was fine-tuned on bilingual datasets in Catalan and Spanish, for a total of 4k hours. Including:
- [Parlament-Parla-v1](https://openslr.org/59/)
- [multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
- [basque_parliament_1](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
- [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) (The datasets will be made accessible shortly.)
- [Coser](https://huggingface.co/datasets/johnatanebonilla/coser)
- [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)
- [common_voice_16_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0)
### Training procedure
This model is the result of finetuning the model ["projecte-aina/stt_ca-es_conformer_transducer_large"](https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large)
### Results
**Spanish WER:** 0.08
**Catalan WER:** 0.10
**Spanish CER:** 0.04
**Catalan CER:** 0.05
|