Mohammed-Alzahrani-ai commited on
Commit
0d9e8fd
·
verified ·
1 Parent(s): 2980f9b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +115 -72
README.md CHANGED
@@ -1,72 +1,115 @@
1
- # NVIDIA Conformer-Transducer Large (ca-es)
2
-
3
- ## Table of Contents
4
- <details>
5
- <summary>Click to expand</summary>
6
-
7
- - [Model Description](#model-description)
8
- - [Intended Uses and Limitations](#intended-uses-and-limitations)
9
- - [How to Get Started with the Model](#how-to-get-started-with-the-model)
10
- - [Training Details](#training-details)
11
- - [Citation](#citation)
12
- - [Additional Information](#additional-information)
13
-
14
- </details>
15
-
16
- ## Summary
17
-
18
- The "stt_ca-es_conformer_transducer_large" is an acoustic model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/) suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.
19
-
20
- ## Model Description
21
-
22
- This model transcribes speech, and was fine-tuned on a Bilingual ca-es dataset comprising of 4000 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. We expaneded it is tokenizer vocab sise to be 5.5k t oinclude lowercase, uppercase, and punctuation
23
- See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
24
-
25
- ## Intended Uses and Limitations
26
-
27
- This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text with punctuation.
28
-
29
- ### Installation
30
-
31
- To use this model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
32
- ```
33
- pip install nemo_toolkit['all']
34
- ```
35
-
36
- ### For Inference
37
- To transcribe audio in Catalan or in Spanish using this model, you can follow this example:
38
-
39
-
40
- ```python
41
- import nemo.collections.asr as nemo_asr
42
- nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
43
- transcription = nemo_asr_model.transcribe([audio_path])[0].text
44
- print(transcription)
45
- ```
46
-
47
- ## Training Details
48
-
49
- ### Training data
50
-
51
- The model was fine-tuned on bilingual datasets in Catalan and Spanish, for a total of 4k hours. Including:
52
- - [Parlament-Parla-v1](https://openslr.org/59/)
53
- - [multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
54
- - [basque_parliament_1](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
55
- - [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) (The datasets will be made accessible shortly.)
56
- - [Coser](https://huggingface.co/datasets/johnatanebonilla/coser)
57
- - [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)
58
- - [common_voice_16_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0)
59
-
60
- ### Training procedure
61
-
62
- This model is the result of finetuning the model ["projecte-aina/stt_ca-es_conformer_transducer_large"](https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large)
63
-
64
-
65
-
66
- ### Results
67
-
68
- **Spanish WER:** 0.08
69
- **Catalan WER:** 0.10
70
-
71
- **Spanish CER:** 0.04
72
- **Catalan CER:** 0.05
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - facebook/multilingual_librispeech
4
+ - Parlament-Parla-v1
5
+ - gttsehu/basque_parliament_1
6
+ - facebook/voxpopuli
7
+ - johnatanebonilla/coser_lv_full
8
+ - collectivat/tv3_parla
9
+ - mozilla-foundation/common_voice_16_0
10
+ language:
11
+ - es
12
+ - ca
13
+ metrics:
14
+ - wer
15
+ - cer
16
+ tags:
17
+ - automatic-speech-recognition
18
+ - speech
19
+ - multilingual
20
+ - nemo
21
+ model-index:
22
+ - name: Mohammed-Alzahrani-ai/stt_ca-es_conformer_transducer_large_fine_tuned
23
+ results:
24
+ - task:
25
+ type: automatic-speech-recognition
26
+ name: Automatic Speech Recognition
27
+ dataset:
28
+ type: automatic-speech-recognition
29
+ name: Combined (Parlament-Parla-v1, MLS, Voxpopuli, etc.)
30
+ metrics:
31
+ - name: WER (Spanish)
32
+ type: wer
33
+ value: 0.08
34
+ - name: CER (Spanish)
35
+ type: cer
36
+ value: 0.04
37
+ - name: WER (Catalan)
38
+ type: wer
39
+ value: 0.10
40
+ - name: CER (Catalan)
41
+ type: cer
42
+ value: 0.05
43
+ ---
44
+ # NVIDIA Conformer-Transducer Large (ca-es)
45
+
46
+ ## Table of Contents
47
+ <details>
48
+ <summary>Click to expand</summary>
49
+
50
+ - [Model Description](#model-description)
51
+ - [Intended Uses and Limitations](#intended-uses-and-limitations)
52
+ - [How to Get Started with the Model](#how-to-get-started-with-the-model)
53
+ - [Training Details](#training-details)
54
+ - [Citation](#citation)
55
+ - [Additional Information](#additional-information)
56
+
57
+ </details>
58
+
59
+ ## Summary
60
+
61
+ The "stt_ca-es_conformer_transducer_large" is an acoustic model based on ["NVIDIA/stt_es_conformer_transducer_large"](https://huggingface.co/nvidia/stt_es_conformer_transducer_large/) suitable for Bilingual Catalan-Spanish Automatic Speech Recognition.
62
+
63
+ ## Model Description
64
+
65
+ This model transcribes speech, and was fine-tuned on a Bilingual ca-es dataset comprising of 4000 hours. It is a "large" variant of Conformer-Transducer, with around 120 million parameters. We expaneded it is tokenizer vocab sise to be 5.5k t oinclude lowercase, uppercase, and punctuation
66
+ See the [model architecture](#model-architecture) section and [NeMo documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/models.html#conformer-transducer) for complete architecture details.
67
+
68
+ ## Intended Uses and Limitations
69
+
70
+ This model can be used for Automatic Speech Recognition (ASR) in Catalan and Spanish. It is intended to transcribe audio files in Catalan and Spanish to plain text with punctuation.
71
+
72
+ ### Installation
73
+
74
+ To use this model, install [NVIDIA NeMo](https://github.com/NVIDIA/NeMo). We recommend you install it after you've installed the latest PyTorch version.
75
+ ```
76
+ pip install nemo_toolkit['all']
77
+ ```
78
+
79
+ ### For Inference
80
+ To transcribe audio in Catalan or in Spanish using this model, you can follow this example:
81
+
82
+
83
+ ```python
84
+ import nemo.collections.asr as nemo_asr
85
+ nemo_asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(model)
86
+ transcription = nemo_asr_model.transcribe([audio_path])[0].text
87
+ print(transcription)
88
+ ```
89
+
90
+ ## Training Details
91
+
92
+ ### Training data
93
+
94
+ The model was fine-tuned on bilingual datasets in Catalan and Spanish, for a total of 4k hours. Including:
95
+ - [Parlament-Parla-v1](https://openslr.org/59/)
96
+ - [multilingual_librispeech](https://huggingface.co/datasets/facebook/multilingual_librispeech)
97
+ - [basque_parliament_1](https://huggingface.co/datasets/gttsehu/basque_parliament_1)
98
+ - [Voxpopuli](https://huggingface.co/datasets/facebook/voxpopuli) (The datasets will be made accessible shortly.)
99
+ - [Coser](https://huggingface.co/datasets/johnatanebonilla/coser)
100
+ - [tv3_parla](https://huggingface.co/datasets/collectivat/tv3_parla)
101
+ - [common_voice_16_0](https://huggingface.co/datasets/mozilla-foundation/common_voice_16_0)
102
+
103
+ ### Training procedure
104
+
105
+ This model is the result of finetuning the model ["projecte-aina/stt_ca-es_conformer_transducer_large"](https://huggingface.co/projecte-aina/stt_ca-es_conformer_transducer_large)
106
+
107
+
108
+
109
+ ### Results
110
+
111
+ **Spanish WER:** 0.08
112
+ **Catalan WER:** 0.10
113
+
114
+ **Spanish CER:** 0.04
115
+ **Catalan CER:** 0.05