File size: 6,369 Bytes
f15e951
 
 
 
 
 
 
82aafa0
 
 
 
 
 
 
 
 
 
03e9fcc
82aafa0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
03e9fcc
4465cb0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116357f
4465cb0
 
 
390411b
4465cb0
 
 
 
03e9fcc
4465cb0
 
 
390411b
4465cb0
 
 
 
 
390411b
4465cb0
 
 
 
 
 
90d264c
4465cb0
 
 
 
 
 
 
17b9c91
390411b
 
 
4e002b8
390411b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4465cb0
 
 
 
 
 
 
 
1bba044
4465cb0
 
 
 
 
 
 
 
 
 
 
 
 
 
cf3f28a
 
4465cb0
 
 
 
 
 
 
 
 
 
 
116357f
4465cb0
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
license: cc-by-4.0
language:
- ca
- es
base_model:
- nvidia/stt_es_conformer_transducer_large
tags:
- automatic-speech-recognition
- NeMo
model-index:
- name: stt_ca-es_conformer_transducer_large
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: CV Benchmark Catalan Accents
      type: projecte-aina/commonvoice_benchmark_catalan_accents
      config: ca
      split: test
      args:
        language: ca
    metrics:
    - name: Test WER
      type: wer
      value: 2.503
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: Mozilla Common Voice 17.0
      type: mozilla-foundation/common_voice_17_0
      config: ca
      split: test
      args:
        language: es
    metrics:
    - name: Test WER
      type: wer
      value: 3.88
---
# NVIDIA Conformer-Transducer Large (ca-es)

## Table of Contents
<details>
<summary>Click to expand</summary>

- [Model Description](#model-description)
- [Intended Uses and Limitations](#intended-uses-and-limitations)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
- [Training Details](#training-details)
- [Citation](#citation)
- [Additional Information](#additional-information)

</details>

## Summary

The "stt_ca-es_conformer_transducer_large" is an acoustic model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/) suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.

## Model Description

This model transcribes speech in lowercase Catalan and Spanish alphabet including spaces, and was fine-tuned on a Bilingual ca-es dataset comprising of 7426 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters.
See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.

## Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text without punctuation.

### Installation

To use this model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
```
pip install nemo_toolkit['all']
``` 

### For Inference
To transcribe audio in Catalan or in Spanish using this model, you can follow this example:


```python
import nemo.collections.asr as nemo_asr

nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
transcription = nemo_asr_model.transcribe([audio_path])[0].text
print(transcription)
```

## Training Details

### Training data

The model was trained on bilingual datasets in Catalan and Spanish, for a total of 7k hours. Including:
- [Parlament-Parla-v3](https://huggingface.co/datasets/projecte-aina/parlament_parla_v3)
- [Corts Valencianes](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a)
- [3cat](https://www.isca-archive.org/iberspeech_2024/hernandezmena24_iberspeech.pdf) 
- [IB3](https://huggingface.co/datasets/projecte-aina/ib3_ca_asr) (The datasets will be made accessible shortly.)
- [ciempiess light](https://huggingface.co/datasets/ciempiess/ciempiess_light)
- [ciempiess fem](https://huggingface.co/datasets/ciempiess/ciempiess_fem)
- [ciempiess complementary](https://huggingface.co/datasets/ciempiess/ciempiess_complementary)
- [ciempiess balance](https://huggingface.co/datasets/ciempiess/ciempiess_balance)
- [CHM150](https://huggingface.co/datasets/carlosdanielhernandezmena/chm150_asr)
- [Tedx spanish](https://huggingface.co/datasets/ciempiess/tedx_spanish)
- [librivox spanish](https://huggingface.co/datasets/ciempiess/librivox_spanish)
- [Wikipedia spanish](https://huggingface.co/datasets/ciempiess/wikipedia_spanish)
- [voxforge spanish](https://huggingface.co/datasets/ciempiess/voxforge_spanish)
- [Tele con ciencia](https://huggingface.co/datasets/ciempiess/tele_con_ciencia)
- [Argentinian Spanish Speech Dataset](https://openslr.org/61/)
- [Dimex100 light](https://huggingface.co/datasets/carlosdanielhernandezmena/dimex100_light)
- [Glissando Spanish](https://glissando.labfon.uned.es/es)
- [Herico](https://openslr.org/39/)
- [Latino40](https://catalog.ldc.upenn.edu/LDC95S28)
- [Common voice 17 es](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0)

### Training procedure

This model is the result of finetuning the base model ["Nvidia/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large) by following this [tutorial](https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Transducers_with_HF_Datasets.ipynb).

## Citation
If this model contributes to your research, please cite the work:
```bibtex
@misc{conformer-transducer-BSC-2024,
      title={Bilingual ca-es ASR Model: stt_ca-es_conformer_transducer_large.}, 
      author={Messaoudi, Abir; Külebi, Baybars},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large},
      year={2024}
}
```

## Additional Information

### Author

The fine-tuning process was performed during 2024 in the [Language Technologies Unit](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/) by [Abir Messaoudi](https://huggingface.co/AbirMessaoudi).

For the Catalan Valencian data we had the collaboration of [CENID](https://cenid.es/) within the framework of the [ILENIA](https://proyectoilenia.es) project.

### Contact
For further information, please send an email to <[email protected]>.

### Copyright
Copyright(c) 2024 by Language Technologies Unit, Barcelona Supercomputing Center.

### License

[CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/)

### Funding
This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.