Commit
·
255bb59
1
Parent(s):
c6118ea
Rearrange order and add introduction text
Browse files
README.md
CHANGED
@@ -35,6 +35,8 @@ This is a Danish state-of-the-art speech recognition model, trained as part of t
|
|
35 |
|
36 |
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
|
37 |
|
|
|
|
|
38 |
## Quick Start
|
39 |
|
40 |
Start by installing the required libraries:
|
@@ -82,6 +84,22 @@ The model was evaluated using the following metrics:
|
|
82 |
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
|
83 |
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
|
84 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
85 |
|
86 |
### Conversational CoRal Performance
|
87 |
|
@@ -200,23 +218,6 @@ Note that the high generalization error on conversation data for models trained
|
|
200 |
|
201 |
</details>
|
202 |
|
203 |
-
|
204 |
-
### Performance on Other Datasets
|
205 |
-
|
206 |
-
The model was also tested against other datasets to evaluate generalizability:
|
207 |
-
|
208 |
-
| | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | | **Røst-wav2vec2-2B-v2** | |
|
209 |
-
| ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
|
210 |
-
| **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
|
211 |
-
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | 16.3 | 6.5 | 16.4 | 6.5 | 16.0 | 6.2 |
|
212 |
-
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
|
213 |
-
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
|
214 |
-
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
|
215 |
-
| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
|
216 |
-
| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
|
217 |
-
|
218 |
-
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
219 |
-
|
220 |
---
|
221 |
|
222 |
### Note on comparing whisper and wav2vec2 models
|
|
|
35 |
|
36 |
This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
|
37 |
|
38 |
+
The model has been evaluated comprehensively and røst-wav2vec2-2B-v2 has demonstrated superior performance on multiple test sets. It achieves the lowest error rates among all other models on the tentative [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) test set. Furthermore, it recieves the lowest errors on multiple zero-shot test sets, achieving new state-of-the-art results in Danish ASR technology.
|
39 |
+
|
40 |
## Quick Start
|
41 |
|
42 |
Start by installing the required libraries:
|
|
|
84 |
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
|
85 |
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
|
86 |
|
87 |
+
### Zero-shot performance on open evaluation datasets
|
88 |
+
|
89 |
+
To evaluate generalizability, the model was evaluated against multiple open-source datasets. Each of the røst-wav2vec2-v2 models improved on the previous state-of-the-art (røst-whisper-large-v1), with the 2B model achieving new state-of-the-art results on all the zero-shot test sets. Røst-whisper-large-v1 still achieves lower error rates on the CoRal-v1 test set:
|
90 |
+
|
91 |
+
| | **Røst-wav2vec2-2B-v2** | | **Røst-wav2vec2-1B-v2** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-315M-v1** | | **Røst-whisper-large-v1** | |
|
92 |
+
| ------------------------------------------------------------------------------------- | ----------------------- | --------- | ----------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- |
|
93 |
+
| **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
|
94 |
+
| [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | 16.0 | 6.2 | 16.4 | 6.5 | 16.3 | 6.5 | 17.0 | 6.6 | **10.4** | **4.3** |
|
95 |
+
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | **27.0** | **11.7** | 27.7 | 11.9 | 28.4 | 12.4 | 29.7 | 13.9 | 29.8 | 14.5 |
|
96 |
+
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | **12.0** | **4.5** | 26.3 | 10.9 | 14.4 | 5.4 | 16.7 | 6.6 | 15.6 | 8.2 |
|
97 |
+
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.5** | **5.1** | 13.7 | 5.5 | 15.6 | 6.1 | 16.6 | 6.3 | 12.6 | **5.1** |
|
98 |
+
| [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | **8.1** | **3.1** | 9.1 | 3.6 | 11.3 | 4.4 | 14.8 | 6.0 | 9.2 | 3.9 |
|
99 |
+
| [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | **6.5** | **2.4** | 7.2 | 2.7 | 8.0 | 3.0 | 7.9 | 3.0 | 7.5 | 2.8 |
|
100 |
+
|
101 |
+
**OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
|
102 |
+
|
103 |
|
104 |
### Conversational CoRal Performance
|
105 |
|
|
|
218 |
|
219 |
</details>
|
220 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
221 |
---
|
222 |
|
223 |
### Note on comparing whisper and wav2vec2 models
|