MarieAlvenir commited on
Commit
a3090ea
·
1 Parent(s): 90809fc

Plots and tables updated with 2B model included

Browse files
README.md CHANGED
@@ -91,37 +91,42 @@ The results are tentative as the test set only includes 5 unique speakers, of wh
91
 
92
  Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
93
 
 
94
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
95
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
96
- | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | **23.9%**| **36.7%** |
97
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
98
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
99
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
100
- | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 78.2% | 72.6% |
 
101
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
102
 
103
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-cer.png">
104
 
105
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-conversation-wer.png">
 
 
106
 
107
 
108
 
109
  ### Read-aloud CoRal Performance
110
 
111
- | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
112
- | :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
113
- | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
114
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
115
- | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
116
- | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
117
- | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
118
- | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
 
 
119
 
120
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
121
 
122
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-cer.png">
123
 
124
- <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/comparison-read_aloud-wer.png">
125
 
126
 
127
  <details>
@@ -129,24 +134,26 @@ Note that the high generalization error on conversation data for models trained
129
  <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
130
  </summary>
131
 
132
- | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
133
- |:---:|:---:|:---:|:---:|:---:|
134
- | female | 5.1 | 7.4 | 7.2 | 7.3 |
135
- | male | 3.6 | 5.8 | 5.7 | 5.8 |
136
- | 0-25 | 3.4 | 5.4 | 5.3 | 5.1 |
137
- | 25-50 | 4.0 | 6.2 | 6.0 | 5.7 |
138
- | 50+ | 5.0 | 7.5 | 7.4 | 7.8 |
139
- | Bornholmsk | 3.8 | 6.8 | 6.1 | 6.2 |
140
- | Fynsk | 5.1 | 7.4 | 7.2 | 6.9 |
141
- | Københavnsk | 1.9 | 3.3 | 3.2 | 3.0 |
142
- | Non-native | 4.8 | 7.8 | 7.5 | 7.3 |
143
- | Nordjysk | 1.6 | 2.6 | 2.8 | 2.6 |
144
- | Sjællandsk | 3.0 | 4.4 | 4.5 | 3.9 |
145
- | Sydømål | 4.1 | 6.4 | 6.4 | 6.5 |
146
- | Sønderjysk | 8.8 | 11.9 | 11.6 | 12.6 |
147
- | Vestjysk | 6.4 | 10.1 | 9.8 | 10.5 |
148
- | Østjysk | 2.6 | 4.0 | 4.1 | 3.8 |
149
- | Overall | 4.3 | 6.6 | 6.5 | 6.5 |
 
 
150
 
151
  </details>
152
 
@@ -155,24 +162,26 @@ Note that the high generalization error on conversation data for models trained
155
  <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
156
  </summary>
157
 
158
- | Category | Røst-whisper-large-v1 | Røst-wav2vec2-315m-v1 | Røst-wav2vec2-315m-v2 | Røst-wav2vec2-1B-v2 |
159
- |:---:|:---:|:---:|:---:|:---:|
160
- | female | 11.5 | 18.5 | 17.7 | 17.8 |
161
- | male | 9.4 | 15.5 | 14.9 | 15.0 |
162
- | 0-25 | 9.0 | 14.7 | 14.0 | 13.7 |
163
- | 25-50 | 10.1 | 16.6 | 15.8 | 15.3 |
164
- | 50+ | 11.3 | 18.2 | 17.7 | 18.5 |
165
- | Bornholmsk | 9.8 | 17.7 | 15.7 | 16.4 |
166
- | Fynsk | 12.1 | 18.3 | 17.7 | 16.7 |
167
- | Københavnsk | 5.9 | 10.2 | 10.0 | 9.5 |
168
- | Non-native | 12.2 | 20.9 | 19.4 | 19.4 |
169
- | Nordjysk | 4.5 | 7.7 | 7.5 | 7.3 |
170
- | Sjællandsk | 7.6 | 12.6 | 12.7 | 11.0 |
171
- | Sydømål | 10.0 | 14.9 | 15.3 | 14.4 |
172
- | Sønderjysk | 17.5 | 26.0 | 25.4 | 27.8 |
173
- | Vestjysk | 15.0 | 26.3 | 25.2 | 26.7 |
174
- | Østjysk | 7.5 | 11.7 | 11.3 | 10.8 |
175
- | Overall | 10.4 | 17.0 | 16.3 | 16.4 |
 
 
176
 
177
  </details>
178
 
@@ -183,14 +192,17 @@ Note that the high generalization error on conversation data for models trained
183
 
184
  The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
185
 
186
- | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
187
- | :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
188
- | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
189
- | CoRal-project/roest-wav2vec2-1B-v2 | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
190
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
191
- | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
192
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
193
- | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
 
 
 
194
 
195
  </details>
196
 
@@ -198,13 +210,17 @@ Note that the high generalization error on conversation data for models trained
198
  ### Performance on Other Datasets
199
 
200
  The model was also tested against other datasets to evaluate generalizability:
201
- | | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | |
202
- | ------------------------------------------------------------------------------------- | -------------------------- | ------- | -------------------------- | ----- | -------------------------- | ------- | ------------------------ | ------- |
203
- | **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
204
- | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | 16.3 | 6.5 | 16.4 | **6.5** |
205
- | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | **12.4** | **4.9** |
206
- | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | **14.4** | **5.4** | 26.3 | 10.9 |
207
- | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | **12.6** | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 |
 
 
 
 
208
 
209
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
210
 
 
91
 
92
  Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
93
 
94
+
95
  | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
96
  | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
97
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | **23.6%** | **34.3** |
98
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
99
+ | [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
100
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
101
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
102
+ | [syvai/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 78.2% | 72.6% |
103
  | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
104
 
 
105
 
106
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-conv.png">
107
+
108
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-conv.png">
109
 
110
 
111
 
112
  ### Read-aloud CoRal Performance
113
 
114
+
115
+ | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
116
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
117
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | 6.2% ± 0.2% | 16.0% ± 0.4% |
118
+ | CoRal-project/roest-wav2vec2-1B-v2 (This model) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
119
+ | [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
120
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
121
+ | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
122
+ | [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
123
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
124
 
125
  **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
126
 
127
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/cer_comparison-read-aloud.png">
128
 
129
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2/resolve/main/images/wer_comparison-read-aloud.png">
130
 
131
 
132
  <details>
 
134
  <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
135
  </summary>
136
 
137
+
138
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
139
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
140
+ | female | 12.3 | 5.4 | 5.1 | 7.4 | 7.2 | 7.3 | 7.2 |
141
+ | male | 10.6 | 4.1 | 3.6 | 5.8 | 5.7 | 5.8 | 5.3 |
142
+ | 0-25 | 9.1 | 3.8 | 3.4 | 5.4 | 5.3 | 5.1 | 4.7 |
143
+ | 25-50 | 11.4 | 4.7 | 4.0 | 6.2 | 6.0 | 5.7 | 5.3 |
144
+ | 50+ | 12.4 | 5.2 | 5.0 | 7.5 | 7.4 | 7.8 | 7.7 |
145
+ | Bornholmsk | 12.1 | 3.8 | 3.8 | 6.8 | 6.1 | 6.2 | 5.7 |
146
+ | Fynsk | 12.0 | 5.9 | 5.1 | 7.4 | 7.2 | 6.9 | 6.1 |
147
+ | Københavnsk | 5.6 | 2.1 | 1.9 | 3.3 | 3.2 | 3.0 | 2.6 |
148
+ | Non-native | 17.4 | 5.9 | 4.8 | 7.8 | 7.5 | 7.3 | 6.6 |
149
+ | Nordjysk | 4.7 | 1.5 | 1.6 | 2.6 | 2.8 | 2.6 | 2.3 |
150
+ | Sjællandsk | 8.0 | 3.3 | 3.0 | 4.4 | 4.5 | 3.9 | 3.8 |
151
+ | Sydømål | 7.7 | 4.3 | 4.1 | 6.4 | 6.4 | 6.5 | 5.8 |
152
+ | Sønderjysk | 20.0 | 9.4 | 8.8 | 11.9 | 11.6 | 12.6 | 13.3 |
153
+ | Vestjysk | 17.6 | 7.2 | 6.4 | 10.1 | 9.8 | 10.5 | 10.8 |
154
+ | Østjysk | 5.9 | 2.9 | 2.6 | 4.0 | 4.1 | 3.8 | 3.5 |
155
+ | Overall | 11.4 | 4.7 | 4.3 | 6.6 | 6.5 | 6.5 | 6.2 |
156
+
157
 
158
  </details>
159
 
 
162
  <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
163
  </summary>
164
 
165
+
166
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
167
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
168
+ | female | 30.2 | 12.7 | 11.5 | 18.5 | 17.7 | 17.8 | 17.8 |
169
+ | male | 26.5 | 10.9 | 9.4 | 15.5 | 14.9 | 15.0 | 14.3 |
170
+ | 0-25 | 24.1 | 10.3 | 9.0 | 14.7 | 14.0 | 13.7 | 12.9 |
171
+ | 25-50 | 28.4 | 12.2 | 10.1 | 16.6 | 15.8 | 15.3 | 14.5 |
172
+ | 50+ | 30.0 | 12.1 | 11.3 | 18.2 | 17.7 | 18.5 | 18.7 |
173
+ | Bornholmsk | 31.6 | 10.4 | 9.8 | 17.7 | 15.7 | 16.4 | 15.3 |
174
+ | Fynsk | 29.3 | 14.3 | 12.1 | 18.3 | 17.7 | 16.7 | 15.2 |
175
+ | Københavnsk | 16.8 | 6.7 | 5.9 | 10.2 | 10.0 | 9.5 | 8.4 |
176
+ | Non-native | 40.9 | 15.4 | 12.2 | 20.9 | 19.4 | 19.4 | 18.1 |
177
+ | Nordjysk | 13.5 | 4.3 | 4.5 | 7.7 | 7.5 | 7.3 | 6.9 |
178
+ | Sjællandsk | 21.7 | 8.9 | 7.6 | 12.6 | 12.7 | 11.0 | 10.5 |
179
+ | Sydømål | 19.2 | 10.4 | 10.0 | 14.9 | 15.3 | 14.4 | 13.7 |
180
+ | Sønderjysk | 44.3 | 19.0 | 17.5 | 26.0 | 25.4 | 27.8 | 29.6 |
181
+ | Vestjysk | 42.0 | 17.7 | 15.0 | 26.3 | 25.2 | 26.7 | 28.3 |
182
+ | Østjysk | 16.9 | 8.2 | 7.5 | 11.7 | 11.3 | 10.8 | 10.1 |
183
+ | Overall | 28.3 | 11.8 | 10.4 | 17.0 | 16.3 | 16.4 | 16.0 |
184
+
185
 
186
  </details>
187
 
 
192
 
193
  The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
194
 
195
+
196
+ | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
197
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
198
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | Yes | **6.2% ± 0.2%** | **16.0% ± 0.4%** |
199
+ | [CoRal-project/roest-wav2vec2-2B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2) | 2B | Read-aloud and conversation | No | 7.8% ± 0.2% | 23.0% ± 0.4% |
200
+ | CoRal-project/roest-wav2vec2-1B-v2 | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
201
+ | CoRal-project/roest-wav2vec2-1B-v2 | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
202
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
203
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
204
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
205
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
206
 
207
  </details>
208
 
 
210
  ### Performance on Other Datasets
211
 
212
  The model was also tested against other datasets to evaluate generalizability:
213
+
214
+ | | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | | **Røst-wav2vec2-2B-v2** | |
215
+ | ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
216
+ | **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
217
+ | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | 16.3 | 6.5 | 16.4 | 6.5 | 16.0 | 6.2 |
218
+ | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
219
+ | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
220
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
221
+ | [AlvenirOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
222
+ | [AlvenirWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
223
+
224
 
225
  **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
226
 
images/cer.png DELETED
Binary file (66.5 kB)
 
images/cer_comparison-conv.png ADDED
images/cer_comparison-read-aloud.png ADDED
images/comparison-conversation-cer.png DELETED
Binary file (55.7 kB)
 
images/comparison-conversation-wer.png DELETED
Binary file (57.2 kB)
 
images/comparison-read_aloud-cer.png DELETED
Binary file (76.3 kB)
 
images/comparison-read_aloud-wer.png DELETED
Binary file (69.1 kB)
 
images/wer.png DELETED
Binary file (66.4 kB)
 
images/wer_comparison-conv.png ADDED
images/wer_comparison-read-aloud.png ADDED