Plots and tables updated with 2B model included

Browse files

Files changed (11) hide show

README.md +84 -68
images/cer.png +0 -0
images/cer_comparison-conv.png +0 -0
images/cer_comparison-read-aloud.png +0 -0
images/comparison-conversation-cer.png +0 -0
images/comparison-conversation-wer.png +0 -0
images/comparison-read_aloud-cer.png +0 -0
images/comparison-read_aloud-wer.png +0 -0
images/wer.png +0 -0
images/wer_comparison-conv.png +0 -0
images/wer_comparison-read-aloud.png +0 -0

README.md CHANGED Viewed

@@ -91,37 +91,42 @@ The results are tentative as the test set only includes 5 unique speakers, of wh
 Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
-| CoRal-project/roest-wav2vec2-1B-v2 (This model)     |                   1B | Read-aloud and conversation |                                                                                                         **23.9%**|                                                                                                         **36.7%** |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
-| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1)              |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
-| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1)             |                 315M |                  Read-aloud |                                                                                                          123% |                                                                                                         80.5% |
-| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                                                         78.2% |                                                                                                         72.6% |
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                    |                1540M |                           - |                                                                                                        46.4 % |                                                                                                         57.4% |
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-cer.png">
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-wer.png">
 ### Read-aloud CoRal Performance
-| Model                                                                                            | Number of parameters |   Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
-| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
-| CoRal-project/roest-wav2vec2-1B-v2 (This model) |                 1B | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.4% ± 0.4% |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
-| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1)            |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
-| [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1)                      |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
-| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                  |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
-| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                 |                1540M |                           - |                                                                            11.4% ± 0.3% |                                                                            28.3% ± 0.6% |
 **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than  reported in the model card.
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-cer.png">
-<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-wer.png">
 <details>
@@ -129,24 +134,26 @@ Note that the high generalization error on conversation data for models trained
     <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
-  | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
-  |:---:|:---:|:---:|:---:|:---:|
-  | female | 5.1 | 7.4 | 7.2 | 7.3 |
-  | male | 3.6 | 5.8 | 5.7 | 5.8 |
-  | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
-  | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
-  | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
-  | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
-  | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
-  | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
-  | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
-  | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
-  | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
-  | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
-  | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
-  | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
-  | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
-  | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
 </details>
@@ -155,24 +162,26 @@ Note that the high generalization error on conversation data for models trained
     <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
-  | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
-  |:---:|:---:|:---:|:---:|:---:|
-  | female | 11.5 | 18.5 | 17.7 | 17.8 |
-  | male | 9.4 | 15.5 | 14.9 | 15.0 |
-  | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
-  | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
-  | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
-  | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
-  | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
-  | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
-  | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
-  | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
-  | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
-  | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
-  | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
-  | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
-  | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
-  | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
 </details>
@@ -183,14 +192,17 @@ Note that the high generalization error on conversation data for models trained
   The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
-| Model                                                                                         | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
-| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
-| CoRal-project/roest-wav2vec2-1B-v2 (This model) |                 1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                        **16.4% ± 0.4%** |
-| CoRal-project/roest-wav2vec2-1B-v2 |                 1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                            23.9% ± 0.4% |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                        **16.3% ± 0.4%** |
-| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                            25.1% ± 0.4% |
-| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1)                   |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
-| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1)                   |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                            26.3% ± 0.5% |
 </details>
@@ -198,13 +210,17 @@ Note that the high generalization error on conversation data for models trained
 ### Performance on Other Datasets
 The model was also tested against other datasets to evaluate generalizability:
-|                                                                                       | **Røst-whisper-large-v1** |         | **Røst-wav2vec2-315M-v1** |       | **Røst-wav2vec2-315M-v2** |         | **Røst-wav2vec2-1B-v2** |         |
-| ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
-| **Evaluation Dataset**                                                                | **WER %**                  | **CER %** | **WER %**                  | **CER %** | **WER %**                  | **CER %**   | **WER %**                | **CER %** |
-| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                   | **4.3** | 17.0                       | 6.6   | 16.3                  | 6.5 | 16.4                     | **6.5** |
-| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                       | 14.5    | 29.7                       | 13.9  |  28.4                       | 12.4    | **12.4**                 | **4.9** |
-| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                       | 8.2     | 16.7                       | 6.6   | **14.4**                   | **5.4** | 26.3                     | 10.9    |
-| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.6**                   | **5.1** | 16.6                       | 6.3   | 15.6                       | 6.1     | 13.7                 | 5.5 |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

 Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
 | Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
 | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)     |                   2B | Read-aloud and conversation |                                                                                                     **23.6%** |                                                                                                      **34.3** |
+| CoRal-project/roest-wav2vec2-1B-v2 (This model)    |                   1B | Read-aloud and conversation |                                                                                                         23.9% |                                                                                                         36.7% |
+| [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2)                                                 |                 315M | Read-aloud and conversation |                                                                                                         24.2% |                                                                                                         37.7% |
+| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                                                          138% |                                                                                                          121% |
+| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                                                                                          123% |                                                                                                         80.5% |
+| [syvai/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                           |                1540M |                  Read-aloud |                                                                                                         78.2% |                                                                                                         72.6% |
 | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                    |                1540M |                           - |                                                                                                        46.4 % |                                                                                                         57.4% |
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-conv.png">
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-conv.png">
 ### Read-aloud CoRal Performance
+| Model                                                                                               | Number of parameters |   Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
+| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)     |                   2B | Read-aloud and conversation |                                                                             6.2% ± 0.2% |                                                                            16.0% ± 0.4% |
+| CoRal-project/roest-wav2vec2-1B-v2 (This model)     |                   1B | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.4% ± 0.4% |
+| [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2)                                                   |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
+| [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
+| [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
+| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                     |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
+| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                    |                1540M |                           - |                                                                            11.4% ± 0.3% |                                                                            28.3% ± 0.6% |
 **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than  reported in the model card.
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-read-aloud.png">
+<img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-read-aloud.png">
 <details>
     <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
+|  Category   | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
+| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
+|   female    |       12.3       |    5.4    |          5.1          |          7.4          |          7.2          |         7.3         |         7.2         |
+|    male     |       10.6       |    4.1    |          3.6          |          5.8          |          5.7          |         5.8         |         5.3         |
+|    0-25     |       9.1        |    3.8    |          3.4          |          5.4          |          5.3          |         5.1         |         4.7         |
+|    25-50    |       11.4       |    4.7    |          4.0          |          6.2          |          6.0          |         5.7         |         5.3         |
+|     50+     |       12.4       |    5.2    |          5.0          |          7.5          |          7.4          |         7.8         |         7.7         |
+| Bornholmsk  |       12.1       |    3.8    |          3.8          |          6.8          |          6.1          |         6.2         |         5.7         |
+|    Fynsk    |       12.0       |    5.9    |          5.1          |          7.4          |          7.2          |         6.9         |         6.1         |
+| Københavnsk |       5.6        |    2.1    |          1.9          |          3.3          |          3.2          |         3.0         |         2.6         |
+| Non-native  |       17.4       |    5.9    |          4.8          |          7.8          |          7.5          |         7.3         |         6.6         |
+|  Nordjysk   |       4.7        |    1.5    |          1.6          |          2.6          |          2.8          |         2.6         |         2.3         |
+| Sjællandsk  |       8.0        |    3.3    |          3.0          |          4.4          |          4.5          |         3.9         |         3.8         |
+|   Sydømål   |       7.7        |    4.3    |          4.1          |          6.4          |          6.4          |         6.5         |         5.8         |
+| Sønderjysk  |       20.0       |    9.4    |          8.8          |         11.9          |         11.6          |        12.6         |        13.3         |
+|  Vestjysk   |       17.6       |    7.2    |          6.4          |         10.1          |          9.8          |        10.5         |        10.8         |
+|   Østjysk   |       5.9        |    2.9    |          2.6          |          4.0          |          4.1          |         3.8         |         3.5         |
+|   Overall   |       11.4       |    4.7    |          4.3          |          6.6          |          6.5          |         6.5         |         6.2         |
 </details>
     <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
   </summary>
+|  Category   | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
+| :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
+|   female    |       30.2       |   12.7    |         11.5          |         18.5          |         17.7          |        17.8         |        17.8         |
+|    male     |       26.5       |   10.9    |          9.4          |         15.5          |         14.9          |        15.0         |        14.3         |
+|    0-25     |       24.1       |   10.3    |          9.0          |         14.7          |         14.0          |        13.7         |        12.9         |
+|    25-50    |       28.4       |   12.2    |         10.1          |         16.6          |         15.8          |        15.3         |        14.5         |
+|     50+     |       30.0       |   12.1    |         11.3          |         18.2          |         17.7          |        18.5         |        18.7         |
+| Bornholmsk  |       31.6       |   10.4    |          9.8          |         17.7          |         15.7          |        16.4         |        15.3         |
+|    Fynsk    |       29.3       |   14.3    |         12.1          |         18.3          |         17.7          |        16.7         |        15.2         |
+| Københavnsk |       16.8       |    6.7    |          5.9          |         10.2          |         10.0          |         9.5         |         8.4         |
+| Non-native  |       40.9       |   15.4    |         12.2          |         20.9          |         19.4          |        19.4         |        18.1         |
+|  Nordjysk   |       13.5       |    4.3    |          4.5          |          7.7          |          7.5          |         7.3         |         6.9         |
+| Sjællandsk  |       21.7       |    8.9    |          7.6          |         12.6          |         12.7          |        11.0         |        10.5         |
+|   Sydømål   |       19.2       |   10.4    |         10.0          |         14.9          |         15.3          |        14.4         |        13.7         |
+| Sønderjysk  |       44.3       |   19.0    |         17.5          |         26.0          |         25.4          |        27.8         |        29.6         |
+|  Vestjysk   |       42.0       |   17.7    |         15.0          |         26.3          |         25.2          |        26.7         |        28.3         |
+|   Østjysk   |       16.9       |    8.2    |          7.5          |         11.7          |         11.3          |        10.8         |        10.1         |
+|   Overall   |       28.3       |   11.8    |         10.4          |         17.0          |         16.3          |        16.4         |        16.0         |
 </details>
   The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
+| Model                                                                                               | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
+| :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)                                                                 |                   2B | Read-aloud and conversation |                               Yes |                                                                         **6.2% ± 0.2%** |                                                                         **16.0% ± 0.4%** |
+| [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2)                                                                  |                   2B | Read-aloud and conversation |                                No |                                                                             7.8% ± 0.2% |                                                                             23.0% ± 0.4% |
+| CoRal-project/roest-wav2vec2-1B-v2     |                   1B | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.4% ± 0.4%** |
+| CoRal-project/roest-wav2vec2-1B-v2     |                   1B | Read-aloud and conversation |                                No |                                                                             8.1% ± 0.2% |                                                                             23.9% ± 0.4% |
+| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                         **16.3% ± 0.4%** |
+| [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                             25.1% ± 0.4% |
+| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                             17.0% ± 0.4% |
+| [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                             26.3% ± 0.5% |
 </details>
 ### Performance on Other Datasets
 The model was also tested against other datasets to evaluate generalizability:
+|                                                                                       | **Røst-whisper-large-v1** |           | **Røst-wav2vec2-315M-v1** |           | **Røst-wav2vec2-315M-v2** |           | **Røst-wav2vec2-1B-v2** |           | **Røst-wav2vec2-2B-v2** |           |
+| ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
+| **Evaluation Dataset**                                                                | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**               | **CER %** | **WER %**               | **CER %** |
+| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                  | **4.3**   | 17.0                      | 6.6       | 16.3                      | 6.5       | 16.4                    | 6.5       | 16.0                    | 6.2       |
+| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                      | 14.5      | 29.7                      | 13.9      | 28.4                      | 12.4      | 27.7                    | 11.9      | **27.0**                | **11.7**  |
+| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                      | 8.2       | 16.7                      | 6.6       | 14.4                      | 5.4       | 26.3                    | 10.9      | **12.0**                | **4.5**   |
+| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 12.6                      | **5.1**   | 16.6                      | 6.3       | 15.6                      | 6.1       | 13.7                    | 5.5       | **12.5**                | **5.1**   |
+| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)             | 9.2                       | 3.9       | 14.8                      | 6.0       | 11.3                      | 4.4       | 9.1                     | 3.6       | **8.1**                 | **3.1**   |
+| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)            | 7.5                       | 2.8       | 7.9                       | 3.0       | 8.0                       | 3.0       | 7.2                     | 2.7       | **6.5**                 | **2.4**   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

images/cer.png DELETED Viewed

Binary file (66.5 kB)

images/cer_comparison-conv.png ADDED Viewed

images/cer_comparison-read-aloud.png ADDED Viewed

images/comparison-conversation-cer.png DELETED Viewed

Binary file (55.7 kB)

images/comparison-conversation-wer.png DELETED Viewed

Binary file (57.2 kB)

images/comparison-read_aloud-cer.png DELETED Viewed

Binary file (76.3 kB)

images/comparison-read_aloud-wer.png DELETED Viewed

Binary file (69.1 kB)

images/wer.png DELETED Viewed

Binary file (66.4 kB)

images/wer_comparison-conv.png ADDED Viewed

images/wer_comparison-read-aloud.png ADDED Viewed