CoRal-project
/

roest-wav2vec2-315m-v2

@@ -362,15 +362,17 @@ Comparison of results on different Danish benchmarks:
 The model was also tested against other datasets to evaluate generalizability:
-|                                                                                       | **Røst-whisper-large-v1** |           | **Røst-wav2vec2-315M-v1** |           | **Røst-wav2vec2-315M-v2** |           | **Røst-wav2vec2-1B-v2** |           | **Røst-wav2vec2-2B-v2** |           |
-| ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
-| **Evaluation Dataset**                                                                | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**               | **CER %** | **WER %**               | **CER %** |
-| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                  | **4.3**   | 17.0                      | 6.6       | 16.3                      | 6.5       | 16.4                    | 6.5       | 16.0                    | 6.2       |
-| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                      | 14.5      | 29.7                      | 13.9      | 28.4                      | 12.4      | 27.7                    | 11.9      | **27.0**                | **11.7**  |
-| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                      | 8.2       | 16.7                      | 6.6       | 14.4                      | 5.4       | 26.3                    | 10.9      | **12.0**                | **4.5**   |
-| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 12.6                      | **5.1**   | 16.6                      | 6.3       | 15.6                      | 6.1       | 13.7                    | 5.5       | **12.5**                | **5.1**   |
-| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)               | 9.2                       | 3.9       | 14.8                      | 6.0       | 11.3                      | 4.4       | 9.1                     | 3.6       | **8.1**                 | **3.1**   |
-| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)              | 7.5                       | 2.8       | 7.9                       | 3.0       | 8.0                       | 3.0       | 7.2                     | 2.7       | **6.5**                 | **2.4**   |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

 The model was also tested against other datasets to evaluate generalizability:
+|                                                                                       | **Røst-wav2vec2-2B-v2** |           | **Røst-wav2vec2-1B-v2** |           | **Røst-wav2vec2-315M-v2** |           | **Røst-wav2vec2-315M-v1** |           | **Røst-whisper-large-v1** |           |
+| ------------------------------------------------------------------------------------- | ----------------------- | --------- | ----------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- |
+| **Evaluation Dataset**                                                                | **WER %**               | **CER %** | **WER %**               | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** |
+| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | 16.0                    | 6.2       | 16.4                    | 6.5       | 16.3                      | 6.5       | 17.0                      | 6.6       | **10.4**                  | **4.3**   |
+| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | **27.0**                | **11.7**  | 27.7                    | 11.9      | 28.4                      | 12.4      | 29.7                      | 13.9      | 29.8                      | 14.5      |
+| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | **12.0**                | **4.5**   | 26.3                    | 10.9      | 14.4                      | 5.4       | 16.7                      | 6.6       | 15.6                      | 8.2       |
+| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.5**                | **5.1**   | 13.7                    | 5.5       | 15.6                      | 6.1       | 16.6                      | 6.3       | 12.6                      | **5.1**   |
+| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)             | **8.1**                 | **3.1**   | 9.1                     | 3.6       | 11.3                      | 4.4       | 14.8                      | 6.0       | 9.2                       | 3.9       |
+| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)            | **6.5**                 | **2.4**   | 7.2                     | 2.7       | 8.0                       | 3.0       | 7.9                       | 3.0       | 7.5                       | 2.8       |
 **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

images/cer_comparison_zero-shot_roest.png ADDED Viewed