Commit
·
a3090ea
1
Parent(s):
90809fc
Plots and tables updated with 2B model included
Browse files- README.md +84 -68
- images/cer.png +0 -0
- images/cer_comparison-conv.png +0 -0
- images/cer_comparison-read-aloud.png +0 -0
- images/comparison-conversation-cer.png +0 -0
- images/comparison-conversation-wer.png +0 -0
- images/comparison-read_aloud-cer.png +0 -0
- images/comparison-read_aloud-wer.png +0 -0
- images/wer.png +0 -0
- images/wer_comparison-conv.png +0 -0
- images/wer_comparison-read-aloud.png +0 -0
README.md
CHANGED
@@ -91,37 +91,42 @@ The results are tentative as the test set only includes 5 unique speakers, of wh
|
|
91 |
|
92 |
Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
|
93 |
|
|
|
94 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
95 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
96 |
-
| CoRal-project/roest-wav2vec2-
|
97 |
-
|
|
98 |
-
| [CoRal-project/roest-
|
99 |
-
| [CoRal-project/roest-
|
100 |
-
| [
|
|
|
101 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
|
102 |
|
103 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-cer.png">
|
104 |
|
105 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/
|
|
|
|
|
106 |
|
107 |
|
108 |
|
109 |
### Read-aloud CoRal Performance
|
110 |
|
111 |
-
|
112 |
-
|
|
113 |
-
|
|
114 |
-
| [CoRal-project/roest-wav2vec2-
|
115 |
-
|
|
116 |
-
| [CoRal-project/roest-wav2vec2-
|
117 |
-
| [
|
118 |
-
| [
|
|
|
|
|
119 |
|
120 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
121 |
|
122 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/
|
123 |
|
124 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/
|
125 |
|
126 |
|
127 |
<details>
|
@@ -129,24 +134,26 @@ Note that the high generalization error on conversation data for models trained
|
|
129 |
<b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
|
130 |
</summary>
|
131 |
|
132 |
-
|
133 |
-
|
134 |
-
|
135 |
-
|
136 |
-
|
137 |
-
|
138 |
-
|
139 |
-
|
140 |
-
|
141 |
-
|
142 |
-
|
143 |
-
|
144 |
-
|
|
145 |
-
|
146 |
-
|
147 |
-
|
148 |
-
|
|
149 |
-
|
|
|
|
|
150 |
|
151 |
</details>
|
152 |
|
@@ -155,24 +162,26 @@ Note that the high generalization error on conversation data for models trained
|
|
155 |
<b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
|
156 |
</summary>
|
157 |
|
158 |
-
|
159 |
-
|
160 |
-
|
161 |
-
|
162 |
-
|
163 |
-
|
164 |
-
|
165 |
-
|
166 |
-
|
167 |
-
|
168 |
-
|
169 |
-
|
170 |
-
|
|
171 |
-
|
172 |
-
|
173 |
-
|
174 |
-
|
|
175 |
-
|
|
|
|
|
176 |
|
177 |
</details>
|
178 |
|
@@ -183,14 +192,17 @@ Note that the high generalization error on conversation data for models trained
|
|
183 |
|
184 |
The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
|
185 |
|
186 |
-
|
187 |
-
|
|
188 |
-
|
|
189 |
-
| CoRal-project/roest-wav2vec2-
|
190 |
-
| [CoRal-project/roest-wav2vec2-
|
191 |
-
|
|
192 |
-
|
|
193 |
-
| [CoRal-project/roest-wav2vec2-
|
|
|
|
|
|
|
194 |
|
195 |
</details>
|
196 |
|
@@ -198,13 +210,17 @@ Note that the high generalization error on conversation data for models trained
|
|
198 |
### Performance on Other Datasets
|
199 |
|
200 |
The model was also tested against other datasets to evaluate generalizability:
|
201 |
-
|
202 |
-
|
|
203 |
-
|
|
204 |
-
|
|
205 |
-
| [
|
206 |
-
| [
|
207 |
-
| [
|
|
|
|
|
|
|
|
|
208 |
|
209 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
210 |
|
|
|
91 |
|
92 |
Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
|
93 |
|
94 |
+
|
95 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
96 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
97 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3** |
|
98 |
+
| CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
|
99 |
+
| [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
|
100 |
+
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
|
101 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
|
102 |
+
| [syvai/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 78.2% | 72.6% |
|
103 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
|
104 |
|
|
|
105 |
|
106 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-conv.png">
|
107 |
+
|
108 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-conv.png">
|
109 |
|
110 |
|
111 |
|
112 |
### Read-aloud CoRal Performance
|
113 |
|
114 |
+
|
115 |
+
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
|
116 |
+
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
|
117 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | 6.2% ± 0.2% | 16.0% ± 0.4% |
|
118 |
+
| CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
|
119 |
+
| [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
|
120 |
+
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
|
121 |
+
| [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
122 |
+
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
|
123 |
+
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
|
124 |
|
125 |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
|
126 |
|
127 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-read-aloud.png">
|
128 |
|
129 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-read-aloud.png">
|
130 |
|
131 |
|
132 |
<details>
|
|
|
134 |
<b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
|
135 |
</summary>
|
136 |
|
137 |
+
|
138 |
+
| Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
|
139 |
+
| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
|
140 |
+
| female | 12.3 | 5.4 | 5.1 | 7.4 | 7.2 | 7.3 | 7.2 |
|
141 |
+
| male | 10.6 | 4.1 | 3.6 | 5.8 | 5.7 | 5.8 | 5.3 |
|
142 |
+
| 0-25 | 9.1 | 3.8 | 3.4 | 5.4 | 5.3 | 5.1 | 4.7 |
|
143 |
+
| 25-50 | 11.4 | 4.7 | 4.0 | 6.2 | 6.0 | 5.7 | 5.3 |
|
144 |
+
| 50+ | 12.4 | 5.2 | 5.0 | 7.5 | 7.4 | 7.8 | 7.7 |
|
145 |
+
| Bornholmsk | 12.1 | 3.8 | 3.8 | 6.8 | 6.1 | 6.2 | 5.7 |
|
146 |
+
| Fynsk | 12.0 | 5.9 | 5.1 | 7.4 | 7.2 | 6.9 | 6.1 |
|
147 |
+
| Københavnsk | 5.6 | 2.1 | 1.9 | 3.3 | 3.2 | 3.0 | 2.6 |
|
148 |
+
| Non-native | 17.4 | 5.9 | 4.8 | 7.8 | 7.5 | 7.3 | 6.6 |
|
149 |
+
| Nordjysk | 4.7 | 1.5 | 1.6 | 2.6 | 2.8 | 2.6 | 2.3 |
|
150 |
+
| Sjællandsk | 8.0 | 3.3 | 3.0 | 4.4 | 4.5 | 3.9 | 3.8 |
|
151 |
+
| Sydømål | 7.7 | 4.3 | 4.1 | 6.4 | 6.4 | 6.5 | 5.8 |
|
152 |
+
| Sønderjysk | 20.0 | 9.4 | 8.8 | 11.9 | 11.6 | 12.6 | 13.3 |
|
153 |
+
| Vestjysk | 17.6 | 7.2 | 6.4 | 10.1 | 9.8 | 10.5 | 10.8 |
|
154 |
+
| Østjysk | 5.9 | 2.9 | 2.6 | 4.0 | 4.1 | 3.8 | 3.5 |
|
155 |
+
| Overall | 11.4 | 4.7 | 4.3 | 6.6 | 6.5 | 6.5 | 6.2 |
|
156 |
+
|
157 |
|
158 |
</details>
|
159 |
|
|
|
162 |
<b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
|
163 |
</summary>
|
164 |
|
165 |
+
|
166 |
+
| Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
|
167 |
+
| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
|
168 |
+
| female | 30.2 | 12.7 | 11.5 | 18.5 | 17.7 | 17.8 | 17.8 |
|
169 |
+
| male | 26.5 | 10.9 | 9.4 | 15.5 | 14.9 | 15.0 | 14.3 |
|
170 |
+
| 0-25 | 24.1 | 10.3 | 9.0 | 14.7 | 14.0 | 13.7 | 12.9 |
|
171 |
+
| 25-50 | 28.4 | 12.2 | 10.1 | 16.6 | 15.8 | 15.3 | 14.5 |
|
172 |
+
| 50+ | 30.0 | 12.1 | 11.3 | 18.2 | 17.7 | 18.5 | 18.7 |
|
173 |
+
| Bornholmsk | 31.6 | 10.4 | 9.8 | 17.7 | 15.7 | 16.4 | 15.3 |
|
174 |
+
| Fynsk | 29.3 | 14.3 | 12.1 | 18.3 | 17.7 | 16.7 | 15.2 |
|
175 |
+
| Københavnsk | 16.8 | 6.7 | 5.9 | 10.2 | 10.0 | 9.5 | 8.4 |
|
176 |
+
| Non-native | 40.9 | 15.4 | 12.2 | 20.9 | 19.4 | 19.4 | 18.1 |
|
177 |
+
| Nordjysk | 13.5 | 4.3 | 4.5 | 7.7 | 7.5 | 7.3 | 6.9 |
|
178 |
+
| Sjællandsk | 21.7 | 8.9 | 7.6 | 12.6 | 12.7 | 11.0 | 10.5 |
|
179 |
+
| Sydømål | 19.2 | 10.4 | 10.0 | 14.9 | 15.3 | 14.4 | 13.7 |
|
180 |
+
| Sønderjysk | 44.3 | 19.0 | 17.5 | 26.0 | 25.4 | 27.8 | 29.6 |
|
181 |
+
| Vestjysk | 42.0 | 17.7 | 15.0 | 26.3 | 25.2 | 26.7 | 28.3 |
|
182 |
+
| Østjysk | 16.9 | 8.2 | 7.5 | 11.7 | 11.3 | 10.8 | 10.1 |
|
183 |
+
| Overall | 28.3 | 11.8 | 10.4 | 17.0 | 16.3 | 16.4 | 16.0 |
|
184 |
+
|
185 |
|
186 |
</details>
|
187 |
|
|
|
192 |
|
193 |
The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
|
194 |
|
195 |
+
|
196 |
+
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
197 |
+
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
|
198 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | Yes | **6.2% ± 0.2%** | **16.0% ± 0.4%** |
|
199 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | No | 7.8% ± 0.2% | 23.0% ± 0.4% |
|
200 |
+
| CoRal-project/roest-wav2vec2-1B-v2 | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
|
201 |
+
| CoRal-project/roest-wav2vec2-1B-v2 | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
|
202 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
203 |
+
| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
|
204 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
|
205 |
+
| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
|
206 |
|
207 |
</details>
|
208 |
|
|
|
210 |
### Performance on Other Datasets
|
211 |
|
212 |
The model was also tested against other datasets to evaluate generalizability:
|
213 |
+
|
214 |
+
| | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | | **Røst-wav2vec2-2B-v2** | |
|
215 |
+
| ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
|
216 |
+
| **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
|
217 |
+
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | 16.3 | 6.5 | 16.4 | 6.5 | 16.0 | 6.2 |
|
218 |
+
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
|
219 |
+
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
|
220 |
+
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
|
221 |
+
| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
|
222 |
+
| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
|
223 |
+
|
224 |
|
225 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
226 |
|
images/cer.png
DELETED
Binary file (66.5 kB)
|
|
images/cer_comparison-conv.png
ADDED
![]() |
images/cer_comparison-read-aloud.png
ADDED
![]() |
images/comparison-conversation-cer.png
DELETED
Binary file (55.7 kB)
|
|
images/comparison-conversation-wer.png
DELETED
Binary file (57.2 kB)
|
|
images/comparison-read_aloud-cer.png
DELETED
Binary file (76.3 kB)
|
|
images/comparison-read_aloud-wer.png
DELETED
Binary file (69.1 kB)
|
|
images/wer.png
DELETED
Binary file (66.4 kB)
|
|
images/wer_comparison-conv.png
ADDED
![]() |
images/wer_comparison-read-aloud.png
ADDED
![]() |