Commit
·
578377c
1
Parent(s):
05774d3
Plots and tables updated to include 2B model
Browse files- README.md +50 -45
- images/cer.png +0 -0
- images/cer_comparison-conv.png +0 -0
- images/cer_comparison-read-aloud.png +0 -0
- images/comparison-conversation-cer.png +0 -0
- images/comparison-conversation-wer.png +0 -0
- images/comparison-read_aloud-cer.png +0 -0
- images/comparison-read_aloud-wer.png +0 -0
- images/wer.png +0 -0
- images/wer_comparison-conv.png +0 -0
- images/wer_comparison-read-aloud.png +0 -0
README.md
CHANGED
@@ -189,7 +189,7 @@ Note that the high generalization error on conversation data for models trained
|
|
189 |
|
190 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
191 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
192 |
-
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3
|
193 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
|
194 |
| CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
|
195 |
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
|
@@ -198,9 +198,9 @@ Note that the high generalization error on conversation data for models trained
|
|
198 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
|
199 |
|
200 |
|
201 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/
|
202 |
|
203 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/
|
204 |
|
205 |
|
206 |
### Read-aloud CoRal Performance
|
@@ -221,9 +221,9 @@ Note that the high generalization error on conversation data for models trained
|
|
221 |
**OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
|
222 |
|
223 |
|
224 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/
|
225 |
|
226 |
-
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/
|
227 |
|
228 |
|
229 |
<details>
|
@@ -231,24 +231,26 @@ Note that the high generalization error on conversation data for models trained
|
|
231 |
<b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
|
232 |
</summary>
|
233 |
|
234 |
-
|
235 |
-
|
236 |
-
|
237 |
-
|
238 |
-
|
239 |
-
|
240 |
-
|
241 |
-
|
242 |
-
|
243 |
-
|
244 |
-
|
245 |
-
|
246 |
-
|
|
247 |
-
|
248 |
-
|
249 |
-
|
250 |
-
|
|
251 |
-
|
|
|
|
|
252 |
|
253 |
</details>
|
254 |
|
@@ -257,24 +259,25 @@ Note that the high generalization error on conversation data for models trained
|
|
257 |
<b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
|
258 |
</summary>
|
259 |
|
260 |
-
|
|
261 |
-
|
262 |
-
|
263 |
-
|
264 |
-
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
-
|
270 |
-
|
271 |
-
|
272 |
-
|
273 |
-
|
274 |
-
|
275 |
-
|
276 |
-
|
277 |
-
|
|
|
278 |
|
279 |
</details>
|
280 |
|
@@ -289,6 +292,8 @@ Note that the high generalization error on conversation data for models trained
|
|
289 |
|
290 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
291 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
|
|
|
|
|
292 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
|
293 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
|
294 |
| CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
@@ -347,8 +352,8 @@ Comparison of results on different Danish benchmarks:
|
|
347 |
| [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) | 37.7 | 24.2 | 64.7 | 44.6 | 40.0 | 25.3 |
|
348 |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 28.4 | 12.4 | 34.9 | 12.6 | 28.5 | 12.3 |
|
349 |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4 | 5.4 | 24.1 | 9.1 | 15.1 | 6.0 |
|
350 |
-
| [
|
351 |
-
| [
|
352 |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 15.6 | 6.1 | 23.6 | 8.8 | 16.5 | 6.4 |
|
353 |
|
354 |
</details>
|
@@ -364,8 +369,8 @@ The model was also tested against other datasets to evaluate generalizability:
|
|
364 |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
|
365 |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
|
366 |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
|
367 |
-
| [
|
368 |
-
| [
|
369 |
|
370 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
371 |
|
|
|
189 |
|
190 |
| Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
|
191 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
|
192 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3%** |
|
193 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
|
194 |
| CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
|
195 |
| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
|
|
|
198 |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
|
199 |
|
200 |
|
201 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-conv.png">
|
202 |
|
203 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-conv.png">
|
204 |
|
205 |
|
206 |
### Read-aloud CoRal Performance
|
|
|
221 |
**OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
|
222 |
|
223 |
|
224 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-read-aloud.png">
|
225 |
|
226 |
+
<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-read-aloud.png">
|
227 |
|
228 |
|
229 |
<details>
|
|
|
231 |
<b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
|
232 |
</summary>
|
233 |
|
234 |
+
|
235 |
+
| Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
|
236 |
+
| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
|
237 |
+
| female | 12.3 | 5.4 | 5.1 | 7.4 | 7.2 | 7.3 | 7.2 |
|
238 |
+
| male | 10.6 | 4.1 | 3.6 | 5.8 | 5.7 | 5.8 | 5.3 |
|
239 |
+
| 0-25 | 9.1 | 3.8 | 3.4 | 5.4 | 5.3 | 5.1 | 4.7 |
|
240 |
+
| 25-50 | 11.4 | 4.7 | 4.0 | 6.2 | 6.0 | 5.7 | 5.3 |
|
241 |
+
| 50+ | 12.4 | 5.2 | 5.0 | 7.5 | 7.4 | 7.8 | 7.7 |
|
242 |
+
| Bornholmsk | 12.1 | 3.8 | 3.8 | 6.8 | 6.1 | 6.2 | 5.7 |
|
243 |
+
| Fynsk | 12.0 | 5.9 | 5.1 | 7.4 | 7.2 | 6.9 | 6.1 |
|
244 |
+
| Københavnsk | 5.6 | 2.1 | 1.9 | 3.3 | 3.2 | 3.0 | 2.6 |
|
245 |
+
| Non-native | 17.4 | 5.9 | 4.8 | 7.8 | 7.5 | 7.3 | 6.6 |
|
246 |
+
| Nordjysk | 4.7 | 1.5 | 1.6 | 2.6 | 2.8 | 2.6 | 2.3 |
|
247 |
+
| Sjællandsk | 8.0 | 3.3 | 3.0 | 4.4 | 4.5 | 3.9 | 3.8 |
|
248 |
+
| Sydømål | 7.7 | 4.3 | 4.1 | 6.4 | 6.4 | 6.5 | 5.8 |
|
249 |
+
| Sønderjysk | 20.0 | 9.4 | 8.8 | 11.9 | 11.6 | 12.6 | 13.3 |
|
250 |
+
| Vestjysk | 17.6 | 7.2 | 6.4 | 10.1 | 9.8 | 10.5 | 10.8 |
|
251 |
+
| Østjysk | 5.9 | 2.9 | 2.6 | 4.0 | 4.1 | 3.8 | 3.5 |
|
252 |
+
| Overall | 11.4 | 4.7 | 4.3 | 6.6 | 6.5 | 6.5 | 6.2 |
|
253 |
+
|
254 |
|
255 |
</details>
|
256 |
|
|
|
259 |
<b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
|
260 |
</summary>
|
261 |
|
262 |
+
| Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
|
263 |
+
| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
|
264 |
+
| female | 30.2 | 12.7 | 11.5 | 18.5 | 17.7 | 17.8 | 17.8 |
|
265 |
+
| male | 26.5 | 10.9 | 9.4 | 15.5 | 14.9 | 15.0 | 14.3 |
|
266 |
+
| 0-25 | 24.1 | 10.3 | 9.0 | 14.7 | 14.0 | 13.7 | 12.9 |
|
267 |
+
| 25-50 | 28.4 | 12.2 | 10.1 | 16.6 | 15.8 | 15.3 | 14.5 |
|
268 |
+
| 50+ | 30.0 | 12.1 | 11.3 | 18.2 | 17.7 | 18.5 | 18.7 |
|
269 |
+
| Bornholmsk | 31.6 | 10.4 | 9.8 | 17.7 | 15.7 | 16.4 | 15.3 |
|
270 |
+
| Fynsk | 29.3 | 14.3 | 12.1 | 18.3 | 17.7 | 16.7 | 15.2 |
|
271 |
+
| Københavnsk | 16.8 | 6.7 | 5.9 | 10.2 | 10.0 | 9.5 | 8.4 |
|
272 |
+
| Non-native | 40.9 | 15.4 | 12.2 | 20.9 | 19.4 | 19.4 | 18.1 |
|
273 |
+
| Nordjysk | 13.5 | 4.3 | 4.5 | 7.7 | 7.5 | 7.3 | 6.9 |
|
274 |
+
| Sjællandsk | 21.7 | 8.9 | 7.6 | 12.6 | 12.7 | 11.0 | 10.5 |
|
275 |
+
| Sydømål | 19.2 | 10.4 | 10.0 | 14.9 | 15.3 | 14.4 | 13.7 |
|
276 |
+
| Sønderjysk | 44.3 | 19.0 | 17.5 | 26.0 | 25.4 | 27.8 | 29.6 |
|
277 |
+
| Vestjysk | 42.0 | 17.7 | 15.0 | 26.3 | 25.2 | 26.7 | 28.3 |
|
278 |
+
| Østjysk | 16.9 | 8.2 | 7.5 | 11.7 | 11.3 | 10.8 | 10.1 |
|
279 |
+
| Overall | 28.3 | 11.8 | 10.4 | 17.0 | 16.3 | 16.4 | 16.0 |
|
280 |
+
|
281 |
|
282 |
</details>
|
283 |
|
|
|
292 |
|
293 |
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
|
294 |
| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
|
295 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | Yes | **6.2% ± 0.2%** | **16.0% ± 0.4%** |
|
296 |
+
| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | No | 7.8% ± 0.2% | 23.0% ± 0.4% |
|
297 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
|
298 |
| [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
|
299 |
| CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
|
|
|
352 |
| [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) | 37.7 | 24.2 | 64.7 | 44.6 | 40.0 | 25.3 |
|
353 |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 28.4 | 12.4 | 34.9 | 12.6 | 28.5 | 12.3 |
|
354 |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4 | 5.4 | 24.1 | 9.1 | 15.1 | 6.0 |
|
355 |
+
| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 11.3 | 4.4 | 20.8 | 8.2 | 11.4 | 4.5 |
|
356 |
+
| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 8.0 | 3.0 | 13.5 | 4.5 | 8.2 | 3.0 |
|
357 |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 15.6 | 6.1 | 23.6 | 8.8 | 16.5 | 6.4 |
|
358 |
|
359 |
</details>
|
|
|
369 |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
|
370 |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
|
371 |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
|
372 |
+
| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
|
373 |
+
| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
|
374 |
|
375 |
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
376 |
|
images/cer.png
DELETED
Binary file (66.5 kB)
|
|
images/cer_comparison-conv.png
ADDED
![]() |
images/cer_comparison-read-aloud.png
ADDED
![]() |
images/comparison-conversation-cer.png
DELETED
Binary file (55.7 kB)
|
|
images/comparison-conversation-wer.png
DELETED
Binary file (57.2 kB)
|
|
images/comparison-read_aloud-cer.png
DELETED
Binary file (76.3 kB)
|
|
images/comparison-read_aloud-wer.png
DELETED
Binary file (69.1 kB)
|
|
images/wer.png
DELETED
Binary file (66.4 kB)
|
|
images/wer_comparison-conv.png
ADDED
![]() |
images/wer_comparison-read-aloud.png
ADDED
![]() |