CoRal-project
/

roest-wav2vec2-2B-v2

@@ -35,6 +35,8 @@ This is a Danish state-of-the-art speech recognition model, trained as part of t
 This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
 ## Quick Start
 Start by installing the required libraries:
@@ -82,6 +84,22 @@ The model was evaluated using the following metrics:
 - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
 - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
 ### Conversational CoRal Performance
@@ -200,23 +218,6 @@ Note that the high generalization error on conversation data for models trained
 </details>
-### Performance on Other Datasets
-The model was also tested against other datasets to evaluate generalizability:
-|                                                                                       | **Røst-whisper-large-v1** |           | **Røst-wav2vec2-315M-v1** |           | **Røst-wav2vec2-315M-v2** |           | **Røst-wav2vec2-1B-v2** |           | **Røst-wav2vec2-2B-v2** |           |
-| ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
-| **Evaluation Dataset**                                                                | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**               | **CER %** | **WER %**               | **CER %** |
-| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | **10.4**                  | **4.3**   | 17.0                      | 6.6       | 16.3                      | 6.5       | 16.4                    | 6.5       | 16.0                    | 6.2       |
-| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.8                      | 14.5      | 29.7                      | 13.9      | 28.4                      | 12.4      | 27.7                    | 11.9      | **27.0**                | **11.7**  |
-| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6                      | 8.2       | 16.7                      | 6.6       | 14.4                      | 5.4       | 26.3                    | 10.9      | **12.0**                | **4.5**   |
-| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 12.6                      | **5.1**   | 16.6                      | 6.3       | 15.6                      | 6.1       | 13.7                    | 5.5       | **12.5**                | **5.1**   |
-| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)               | 9.2                       | 3.9       | 14.8                      | 6.0       | 11.3                      | 4.4       | 9.1                     | 3.6       | **8.1**                 | **3.1**   |
-| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)              | 7.5                       | 2.8       | 7.9                       | 3.0       | 8.0                       | 3.0       | 7.2                     | 2.7       | **6.5**                 | **2.4**   |
-**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
 ---
 ### Note on comparing whisper and wav2vec2 models

 This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
+The model has been evaluated comprehensively and røst-wav2vec2-2B-v2 has demonstrated superior performance on multiple test sets. It achieves the lowest error rates among all other models on the tentative [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) test set. Furthermore, it recieves the lowest errors on multiple zero-shot test sets, achieving new state-of-the-art results in Danish ASR technology.
 ## Quick Start
 Start by installing the required libraries:
 - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
 - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
+### Zero-shot performance on open evaluation datasets
+To evaluate generalizability, the model was evaluated against multiple open-source datasets. Each of the røst-wav2vec2-v2 models improved on the previous state-of-the-art (røst-whisper-large-v1), with the 2B model achieving new state-of-the-art results on all the zero-shot test sets. Røst-whisper-large-v1 still achieves lower error rates on the CoRal-v1 test set:
+|                                                                                       | **Røst-wav2vec2-2B-v2** |           | **Røst-wav2vec2-1B-v2** |           | **Røst-wav2vec2-315M-v2** |           | **Røst-wav2vec2-315M-v1** |           | **Røst-whisper-large-v1** |           |
+| ------------------------------------------------------------------------------------- | ----------------------- | --------- | ----------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- |
+| **Evaluation Dataset**                                                                | **WER %**               | **CER %** | **WER %**               | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** | **WER %**                 | **CER %** |
+| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test)   | 16.0                    | 6.2       | 16.4                    | 6.5       | 16.3                      | 6.5       | 17.0                      | 6.6       | **10.4**                  | **4.3**   |
+| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | **27.0**                | **11.7**  | 27.7                    | 11.9      | 28.4                      | 12.4      | 29.7                      | 13.9      | 29.8                      | 14.5      |
+| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | **12.0**                | **4.5**   | 26.3                    | 10.9      | 14.4                      | 5.4       | 16.7                      | 6.6       | 15.6                      | 8.2       |
+| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | **12.5**                | **5.1**   | 13.7                    | 5.5       | 15.6                      | 6.1       | 16.6                      | 6.3       | 12.6                      | **5.1**   |
+| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)             | **8.1**                 | **3.1**   | 9.1                     | 3.6       | 11.3                      | 4.4       | 14.8                      | 6.0       | 9.2                       | 3.9       |
+| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval)            | **6.5**                 | **2.4**   | 7.2                     | 2.7       | 8.0                       | 3.0       | 7.9                       | 3.0       | 7.5                       | 2.8       |
+**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
 ### Conversational CoRal Performance
 </details>
 ---
 ### Note on comparing whisper and wav2vec2 models