MarieAlvenir commited on
Commit
578377c
·
1 Parent(s): 05774d3

Plots and tables updated to include 2B model

Browse files
README.md CHANGED
@@ -189,7 +189,7 @@ Note that the high generalization error on conversation data for models trained
189
 
190
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
191
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
192
- | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3** |
193
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
194
  | CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
195
  | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
@@ -198,9 +198,9 @@ Note that the high generalization error on conversation data for models trained
198
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
199
 
200
 
201
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-conversation-cer.png">
202
 
203
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-conversation-wer.png">
204
 
205
 
206
  ### Read-aloud CoRal Performance
@@ -221,9 +221,9 @@ Note that the high generalization error on conversation data for models trained
221
  **OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
222
 
223
 
224
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-read_aloud-cer.png">
225
 
226
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-read_aloud-wer.png">
227
 
228
 
229
  <details>
@@ -231,24 +231,26 @@ Note that the high generalization error on conversation data for models trained
231
  <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
232
  </summary>
233
 
234
- | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
235
- |:---:|:---:|:---:|:---:|:---:|
236
- | female | 5.1 | 7.4 | 7.2 | 7.3 |
237
- | male | 3.6 | 5.8 | 5.7 | 5.8 |
238
- | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
239
- | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
240
- | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
241
- | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
242
- | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
243
- | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
244
- | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
245
- | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
246
- | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
247
- | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
248
- | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
249
- | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
250
- | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
251
- | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
 
 
252
 
253
  </details>
254
 
@@ -257,24 +259,25 @@ Note that the high generalization error on conversation data for models trained
257
  <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
258
  </summary>
259
 
260
- | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
261
- |:---:|:---:|:---:|:---:|:---:|
262
- | female | 11.5 | 18.5 | 17.7 | 17.8 |
263
- | male | 9.4 | 15.5 | 14.9 | 15.0 |
264
- | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
265
- | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
266
- | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
267
- | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
268
- | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
269
- | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
270
- | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
271
- | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
272
- | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
273
- | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
274
- | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
275
- | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
276
- | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
277
- | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
 
278
 
279
  </details>
280
 
@@ -289,6 +292,8 @@ Note that the high generalization error on conversation data for models trained
289
 
290
  | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
291
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
 
 
292
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
293
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
294
  | CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
@@ -347,8 +352,8 @@ Comparison of results on different Danish benchmarks:
347
  | [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) | 37.7 | 24.2 | 64.7 | 44.6 | 40.0 | 25.3 |
348
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 28.4 | 12.4 | 34.9 | 12.6 | 28.5 | 12.3 |
349
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4 | 5.4 | 24.1 | 9.1 | 15.1 | 6.0 |
350
- | [AppenOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 11.3 | 4.4 | 20.8 | 8.2 | 11.4 | 4.5 |
351
- | [AppenWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 8.0 | 3.0 | 13.5 | 4.5 | 8.2 | 3.0 |
352
  | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 15.6 | 6.1 | 23.6 | 8.8 | 16.5 | 6.4 |
353
 
354
  </details>
@@ -364,8 +369,8 @@ The model was also tested against other datasets to evaluate generalizability:
364
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
365
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
366
  | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
367
- | [AppenOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
368
- | [AppenWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
369
 
370
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
371
 
 
189
 
190
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
191
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
192
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3%** |
193
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
194
  | CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
195
  | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
 
198
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
199
 
200
 
201
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-conv.png">
202
 
203
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-conv.png">
204
 
205
 
206
  ### Read-aloud CoRal Performance
 
221
  **OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
222
 
223
 
224
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-read-aloud.png">
225
 
226
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-read-aloud.png">
227
 
228
 
229
  <details>
 
231
  <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
232
  </summary>
233
 
234
+
235
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
236
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
237
+ | female | 12.3 | 5.4 | 5.1 | 7.4 | 7.2 | 7.3 | 7.2 |
238
+ | male | 10.6 | 4.1 | 3.6 | 5.8 | 5.7 | 5.8 | 5.3 |
239
+ | 0-25 | 9.1 | 3.8 | 3.4 | 5.4 | 5.3 | 5.1 | 4.7 |
240
+ | 25-50 | 11.4 | 4.7 | 4.0 | 6.2 | 6.0 | 5.7 | 5.3 |
241
+ | 50+ | 12.4 | 5.2 | 5.0 | 7.5 | 7.4 | 7.8 | 7.7 |
242
+ | Bornholmsk | 12.1 | 3.8 | 3.8 | 6.8 | 6.1 | 6.2 | 5.7 |
243
+ | Fynsk | 12.0 | 5.9 | 5.1 | 7.4 | 7.2 | 6.9 | 6.1 |
244
+ | Københavnsk | 5.6 | 2.1 | 1.9 | 3.3 | 3.2 | 3.0 | 2.6 |
245
+ | Non-native | 17.4 | 5.9 | 4.8 | 7.8 | 7.5 | 7.3 | 6.6 |
246
+ | Nordjysk | 4.7 | 1.5 | 1.6 | 2.6 | 2.8 | 2.6 | 2.3 |
247
+ | Sjællandsk | 8.0 | 3.3 | 3.0 | 4.4 | 4.5 | 3.9 | 3.8 |
248
+ | Sydømål | 7.7 | 4.3 | 4.1 | 6.4 | 6.4 | 6.5 | 5.8 |
249
+ | Sønderjysk | 20.0 | 9.4 | 8.8 | 11.9 | 11.6 | 12.6 | 13.3 |
250
+ | Vestjysk | 17.6 | 7.2 | 6.4 | 10.1 | 9.8 | 10.5 | 10.8 |
251
+ | Østjysk | 5.9 | 2.9 | 2.6 | 4.0 | 4.1 | 3.8 | 3.5 |
252
+ | Overall | 11.4 | 4.7 | 4.3 | 6.6 | 6.5 | 6.5 | 6.2 |
253
+
254
 
255
  </details>
256
 
 
259
  <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
260
  </summary>
261
 
262
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
263
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
264
+ | female | 30.2 | 12.7 | 11.5 | 18.5 | 17.7 | 17.8 | 17.8 |
265
+ | male | 26.5 | 10.9 | 9.4 | 15.5 | 14.9 | 15.0 | 14.3 |
266
+ | 0-25 | 24.1 | 10.3 | 9.0 | 14.7 | 14.0 | 13.7 | 12.9 |
267
+ | 25-50 | 28.4 | 12.2 | 10.1 | 16.6 | 15.8 | 15.3 | 14.5 |
268
+ | 50+ | 30.0 | 12.1 | 11.3 | 18.2 | 17.7 | 18.5 | 18.7 |
269
+ | Bornholmsk | 31.6 | 10.4 | 9.8 | 17.7 | 15.7 | 16.4 | 15.3 |
270
+ | Fynsk | 29.3 | 14.3 | 12.1 | 18.3 | 17.7 | 16.7 | 15.2 |
271
+ | Københavnsk | 16.8 | 6.7 | 5.9 | 10.2 | 10.0 | 9.5 | 8.4 |
272
+ | Non-native | 40.9 | 15.4 | 12.2 | 20.9 | 19.4 | 19.4 | 18.1 |
273
+ | Nordjysk | 13.5 | 4.3 | 4.5 | 7.7 | 7.5 | 7.3 | 6.9 |
274
+ | Sjællandsk | 21.7 | 8.9 | 7.6 | 12.6 | 12.7 | 11.0 | 10.5 |
275
+ | Sydømål | 19.2 | 10.4 | 10.0 | 14.9 | 15.3 | 14.4 | 13.7 |
276
+ | Sønderjysk | 44.3 | 19.0 | 17.5 | 26.0 | 25.4 | 27.8 | 29.6 |
277
+ | Vestjysk | 42.0 | 17.7 | 15.0 | 26.3 | 25.2 | 26.7 | 28.3 |
278
+ | Østjysk | 16.9 | 8.2 | 7.5 | 11.7 | 11.3 | 10.8 | 10.1 |
279
+ | Overall | 28.3 | 11.8 | 10.4 | 17.0 | 16.3 | 16.4 | 16.0 |
280
+
281
 
282
  </details>
283
 
 
292
 
293
  | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
294
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
295
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | Yes | **6.2% ± 0.2%** | **16.0% ± 0.4%** |
296
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | No | 7.8% ± 0.2% | 23.0% ± 0.4% |
297
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
298
  | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
299
  | CoRal-project/roest-wav2vec2-315M-v2 (This model) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
 
352
  | [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) | 37.7 | 24.2 | 64.7 | 44.6 | 40.0 | 25.3 |
353
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 28.4 | 12.4 | 34.9 | 12.6 | 28.5 | 12.3 |
354
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4 | 5.4 | 24.1 | 9.1 | 15.1 | 6.0 |
355
+ | [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 11.3 | 4.4 | 20.8 | 8.2 | 11.4 | 4.5 |
356
+ | [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 8.0 | 3.0 | 13.5 | 4.5 | 8.2 | 3.0 |
357
  | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 15.6 | 6.1 | 23.6 | 8.8 | 16.5 | 6.4 |
358
 
359
  </details>
 
369
  | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
370
  | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
371
  | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
372
+ | [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
373
+ | [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
374
 
375
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
376
 
images/cer.png DELETED
Binary file (66.5 kB)
 
images/cer_comparison-conv.png ADDED
images/cer_comparison-read-aloud.png ADDED
images/comparison-conversation-cer.png DELETED
Binary file (55.7 kB)
 
images/comparison-conversation-wer.png DELETED
Binary file (57.2 kB)
 
images/comparison-read_aloud-cer.png DELETED
Binary file (76.3 kB)
 
images/comparison-read_aloud-wer.png DELETED
Binary file (69.1 kB)
 
images/wer.png DELETED
Binary file (66.4 kB)
 
images/wer_comparison-conv.png ADDED
images/wer_comparison-read-aloud.png ADDED