Plots and tables updated to include 2B model

Browse files

Files changed (11) hide show

README.md +50 -45
images/cer.png +0 -0
images/cer_comparison-conv.png +0 -0
images/cer_comparison-read-aloud.png +0 -0
images/comparison-conversation-cer.png +0 -0
images/comparison-conversation-wer.png +0 -0
images/comparison-read_aloud-cer.png +0 -0
images/comparison-read_aloud-wer.png +0 -0
images/wer.png +0 -0
images/wer_comparison-conv.png +0 -0
images/wer_comparison-read-aloud.png +0 -0

README.md CHANGED Viewed

@@ -189,7 +189,7 @@ Note that the high generalization error on conversation data for models trained
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
-| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)     |                   2B | Read-aloud and conversation |                                                                                                     **23.6%** |                                                                                                      **34.3** |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                                                         23.9% |                                                                                                         36.7% |
 | CoRal-project/roest-wav2vec2-315M-v2 (This model)                                                   |                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
@@ -198,9 +198,9 @@ Note that the high generalization error on conversation data for models trained
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                    |                1540M |                           - |                                                                                                        46.4 % |                                                                                                         57.4% |
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-conversation-cer.png">
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-conversation-wer.png">
 ### Read-aloud CoRal Performance
@@ -221,9 +221,9 @@ Note that the high generalization error on conversation data for models trained
 **OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-read_aloud-cer.png">
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/comparison-read_aloud-wer.png">
 <details>
@@ -231,24 +231,26 @@ Note that the high generalization error on conversation data for models trained
     <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
-  | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
-  |:---:|:---:|:---:|:---:|:---:|
-  | female | 5.1 | 7.4 | 7.2 | 7.3 |
-  | male | 3.6 | 5.8 | 5.7 | 5.8 |
-  | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
-  | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
-  | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
-  | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
-  | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
-  | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
-  | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
-  | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
-  | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
-  | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
-  | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
-  | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
-  | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
-  | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
 </details>
@@ -257,24 +259,25 @@ Note that the high generalization error on conversation data for models trained
     <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
-  | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
-  |:---:|:---:|:---:|:---:|:---:|
-  | female | 11.5 | 18.5 | 17.7 | 17.8 |
-  | male | 9.4 | 15.5 | 14.9 | 15.0 |
-  | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
-  | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
-  | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
-  | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
-  | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
-  | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
-  | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
-  | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
-  | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
-  | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
-  | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
-  | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
-  | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
-  | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
 </details>
@@ -289,6 +292,8 @@ Note that the high generalization error on conversation data for models trained
   | Model                                                                                               | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
   | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.4% ± 0.4%** |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                             23.9% ± 0.4% |
   | CoRal-project/roest-wav2vec2-315M-v2 (This model) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.3% ± 0.4%** |
@@ -347,8 +352,8 @@ Comparison of results on different Danish benchmarks:
 | [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main)  | 37.7          | 24.2  | 64.7       | 44.6  | 40.0             | 25.3  |
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)          | 28.4          | 12.4  | 34.9       | 12.6  | 28.5             | 12.3  |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4          | 5.4   | 24.1       | 9.1   | 15.1             | 6.0   |
-| [AppenOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)    | 11.3          | 4.4   | 20.8       | 8.2   | 11.4             | 4.5   |
-| [AppenWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)   | 8.0           | 3.0   | 13.5       | 4.5   | 8.2              | 3.0   |
 | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)      | 15.6          | 6.1   | 23.6       | 8.8   | 16.5             | 6.4   |
 </details>
@@ -364,8 +369,8 @@ The model was also tested against other datasets to evaluate generalizability:
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                      | 14.5      | 29.7                      | 13.9      | 28.4                      | 12.4      | 27.7                    | 11.9      | **27.0**                | **11.7**  |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                      | 8.2       | 16.7                      | 6.6       | 14.4                      | 5.4       | 26.3                    | 10.9      | **12.0**                | **4.5**   |
 | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 12.6                      | **5.1**   | 16.6                      | 6.3       | 15.6                      | 6.1       | 13.7                    | 5.5       | **12.5**                | **5.1**   |
-| [AppenOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)               | 9.2                       | 3.9       | 14.8                      | 6.0       | 11.3                      | 4.4       | 9.1                     | 3.6       | **8.1**                 | **3.1**   |
-| [AppenWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)              | 7.5                       | 2.8       | 7.9                       | 3.0       | 8.0                       | 3.0       | 7.2                     | 2.7       | **6.5**                 | **2.4**   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)     |                   2B | Read-aloud and conversation |                                                                                                     **23.6%** |                                                                                                      **34.3%** |
 | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                                                                                         23.9% |                                                                                                         36.7% |
 | CoRal-project/roest-wav2vec2-315M-v2 (This model)                                                   |                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
 | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                    |                1540M |                           - |                                                                                                        46.4 % |                                                                                                         57.4% |
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-conv.png">
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-conv.png">
 ### Read-aloud CoRal Performance
 **OBS!** Benchmark for hviske-v2 has been re-evaluated and the confidence interval is larger than reported in the model card.
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/cer_comparison-read-aloud.png">
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/wer_comparison-read-aloud.png">
 <details>
     <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
+|  Category   | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
+| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
+|   female    |       12.3       |    5.4    |          5.1          |          7.4          |          7.2          |         7.3         |         7.2         |
+|    male     |       10.6       |    4.1    |          3.6          |          5.8          |          5.7          |         5.8         |         5.3         |
+|    0-25     |       9.1        |    3.8    |          3.4          |          5.4          |          5.3          |         5.1         |         4.7         |
+|    25-50    |       11.4       |    4.7    |          4.0          |          6.2          |          6.0          |         5.7         |         5.3         |
+|     50+     |       12.4       |    5.2    |          5.0          |          7.5          |          7.4          |         7.8         |         7.7         |
+| Bornholmsk  |       12.1       |    3.8    |          3.8          |          6.8          |          6.1          |         6.2         |         5.7         |
+|    Fynsk    |       12.0       |    5.9    |          5.1          |          7.4          |          7.2          |         6.9         |         6.1         |
+| Københavnsk |       5.6        |    2.1    |          1.9          |          3.3          |          3.2          |         3.0         |         2.6         |
+| Non-native  |       17.4       |    5.9    |          4.8          |          7.8          |          7.5          |         7.3         |         6.6         |
+|  Nordjysk   |       4.7        |    1.5    |          1.6          |          2.6          |          2.8          |         2.6         |         2.3         |
+| Sjællandsk  |       8.0        |    3.3    |          3.0          |          4.4          |          4.5          |         3.9         |         3.8         |
+|   Sydømål   |       7.7        |    4.3    |          4.1          |          6.4          |          6.4          |         6.5         |         5.8         |
+| Sønderjysk  |       20.0       |    9.4    |          8.8          |         11.9          |         11.6          |        12.6         |        13.3         |
+|  Vestjysk   |       17.6       |    7.2    |          6.4          |         10.1          |          9.8          |        10.5         |        10.8         |
+|   Østjysk   |       5.9        |    2.9    |          2.6          |          4.0          |          4.1          |         3.8         |         3.5         |
+|   Overall   |       11.4       |    4.7    |          4.3          |          6.6          |          6.5          |         6.5         |         6.2         |
 </details>
     <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
+|  Category   | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
+| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
+|   female    |       30.2       |   12.7    |         11.5          |         18.5          |         17.7          |        17.8         |        17.8         |
+|    male     |       26.5       |   10.9    |          9.4          |         15.5          |         14.9          |        15.0         |        14.3         |
+|    0-25     |       24.1       |   10.3    |          9.0          |         14.7          |         14.0          |        13.7         |        12.9         |
+|    25-50    |       28.4       |   12.2    |         10.1          |         16.6          |         15.8          |        15.3         |        14.5         |
+|     50+     |       30.0       |   12.1    |         11.3          |         18.2          |         17.7          |        18.5         |        18.7         |
+| Bornholmsk  |       31.6       |   10.4    |          9.8          |         17.7          |         15.7          |        16.4         |        15.3         |
+|    Fynsk    |       29.3       |   14.3    |         12.1          |         18.3          |         17.7          |        16.7         |        15.2         |
+| Københavnsk |       16.8       |    6.7    |          5.9          |         10.2          |         10.0          |         9.5         |         8.4         |
+| Non-native  |       40.9       |   15.4    |         12.2          |         20.9          |         19.4          |        19.4         |        18.1         |
+|  Nordjysk   |       13.5       |    4.3    |          4.5          |          7.7          |          7.5          |         7.3         |         6.9         |
+| Sjællandsk  |       21.7       |    8.9    |          7.6          |         12.6          |         12.7          |        11.0         |        10.5         |
+|   Sydømål   |       19.2       |   10.4    |         10.0          |         14.9          |         15.3          |        14.4         |        13.7         |
+| Sønderjysk  |       44.3       |   19.0    |         17.5          |         26.0          |         25.4          |        27.8         |        29.6         |
+|  Vestjysk   |       42.0       |   17.7    |         15.0          |         26.3          |         25.2          |        26.7         |        28.3         |
+|   Østjysk   |       16.9       |    8.2    |          7.5          |         11.7          |         11.3          |        10.8         |        10.1         |
+|   Overall   |       28.3       |   11.8    |         10.4          |         17.0          |         16.3          |        16.4         |        16.0         |
 </details>
   | Model                                                                                               | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
   | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
+  | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)                                                                 |                   2B | Read-aloud and conversation |                               Yes |                                                                         **6.2% ± 0.2%** |                                                                         **16.0% ± 0.4%** |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)                                                                  |                   2B | Read-aloud and conversation |                                No |                                                                             7.8% ± 0.2% |                                                                             23.0% ± 0.4% |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.4% ± 0.4%** |
   | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2)     |                   1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                             23.9% ± 0.4% |
   | CoRal-project/roest-wav2vec2-315M-v2 (This model) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.3% ± 0.4%** |
 | [CoRal-v2-conv](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main)  | 37.7          | 24.2  | 64.7       | 44.6  | 40.0             | 25.3  |
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)          | 28.4          | 12.4  | 34.9       | 12.6  | 28.5             | 12.3  |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 14.4          | 5.4   | 24.1       | 9.1   | 15.1             | 6.0   |
+| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)    | 11.3          | 4.4   | 20.8       | 8.2   | 11.4             | 4.5   |
+| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)   | 8.0           | 3.0   | 13.5       | 4.5   | 8.2              | 3.0   |
 | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)      | 15.6          | 6.1   | 23.6       | 8.8   | 16.5             | 6.4   |
 </details>
 | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                      | 14.5      | 29.7                      | 13.9      | 28.4                      | 12.4      | 27.7                    | 11.9      | **27.0**                | **11.7**  |
 | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                      | 8.2       | 16.7                      | 6.6       | 14.4                      | 5.4       | 26.3                    | 10.9      | **12.0**                | **4.5**   |
 | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 12.6                      | **5.1**   | 16.6                      | 6.3       | 15.6                      | 6.1       | 13.7                    | 5.5       | **12.5**                | **5.1**   |
+| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)               | 9.2                       | 3.9       | 14.8                      | 6.0       | 11.3                      | 4.4       | 9.1                     | 3.6       | **8.1**                 | **3.1**   |
+| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)              | 7.5                       | 2.8       | 7.9                       | 3.0       | 8.0                       | 3.0       | 7.2                     | 2.7       | **6.5**                 | **2.4**   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

images/cer.png DELETED Viewed

Binary file (66.5 kB)

images/cer_comparison-conv.png ADDED Viewed

images/cer_comparison-read-aloud.png ADDED Viewed

images/comparison-conversation-cer.png DELETED Viewed

Binary file (55.7 kB)

images/comparison-conversation-wer.png DELETED Viewed

Binary file (57.2 kB)

images/comparison-read_aloud-cer.png DELETED Viewed

Binary file (76.3 kB)

images/comparison-read_aloud-wer.png DELETED Viewed

Binary file (69.1 kB)

images/wer.png DELETED Viewed

Binary file (66.4 kB)

images/wer_comparison-conv.png ADDED Viewed

images/wer_comparison-read-aloud.png ADDED Viewed