MarieAlvenir commited on
Commit
4a4a959
·
1 Parent(s): b9df131

Initial commit of model and readme

Browse files
README.md ADDED
@@ -0,0 +1,256 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ datasets:
3
+ - CoRal-project/coral-v2
4
+ language:
5
+ - da
6
+ base_model:
7
+ - facebook/wav2vec2-xls-r-2b
8
+ metrics:
9
+ - wer
10
+ - cer
11
+ license: openrail
12
+ pipeline_tag: automatic-speech-recognition
13
+ model-index:
14
+ - name: roest-wav2vec2-2B-v2
15
+ results:
16
+ - task:
17
+ type: automatic-speech-recognition
18
+ name: Automatic Speech Recognition
19
+ dataset:
20
+ name: CoRal read-aloud
21
+ type: CoRal-project/coral
22
+ split: test
23
+ args: read_aloud
24
+ metrics:
25
+ - type: cer
26
+ value: 6.2% ± 0.2%
27
+ name: CER
28
+ - type: wer
29
+ value: 16.0% ± 0.4%
30
+ name: WER
31
+ ---
32
+
33
+ # Røst-wav2vec2-2B-v2
34
+ This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
35
+
36
+ This repository contains a Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
37
+
38
+ ## Quick Start
39
+
40
+ Start by installing the required libraries:
41
+
42
+ ```shell
43
+ $ pip install transformers kenlm pyctcdecode
44
+ ```
45
+
46
+ Next you can use the model using the `transformers` Python package as follows:
47
+
48
+ ```python
49
+ >>> from transformers import pipeline
50
+ >>> audio = get_audio() # 16kHz raw audio array
51
+ >>> transcriber = pipeline(model="CoRal-project/roest-wav2vec2-2B-v2")
52
+ >>> transcriber(audio)
53
+ {'text': 'your transcription'}
54
+ ```
55
+
56
+ ---
57
+
58
+
59
+ ## Model Details
60
+
61
+ Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [wav2vec2-xls-r-2b](https://huggingface.co/facebook/wav2vec2-xls-r-2b) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
62
+
63
+ ```bash
64
+ python src/scripts/finetune_asr_model.py \
65
+ model=wav2vec2-large \
66
+ max_steps=30000 \
67
+ datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
68
+ datasets.coral_readaloud_internal.id=CoRal-project/coral-v2
69
+ ```
70
+
71
+ The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
72
+
73
+ The model was trained on the [CoRal-v2](https://huggingface.co/datasets/CoRal-project/coral-v2/tree/main) dataset, including both the conversational and read-aloud subset.
74
+ This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions.
75
+ Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
76
+
77
+ ---
78
+
79
+ ## Evaluation
80
+
81
+ The model was evaluated using the following metrics:
82
+ - **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
83
+ - **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
84
+
85
+
86
+ ### Conversational CoRal Performance
87
+
88
+ The model was firstly evaluated on a tentative version of the coral-v2 conversation dataset.
89
+
90
+ The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.
91
+
92
+ Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.
93
+
94
+ | Model | Number of parameters | Finetuned on data of type | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) CER | [CoRal-v2::conversation](https://huggingface.co/datasets/CoRal-project/coral-v2/viewer/conversation/test) WER |
95
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | ------------------------------------------------------------------------------------------------------------: | ------------------------------------------------------------------------------------------------------------: |
96
+ | CoRal-project/roest-wav2vec2-2B-v2 (This model) | 2B | Read-aloud and conversation | **23.6%** | **34.3** |
97
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 23.9% | 36.7% |
98
+ | [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 24.2% | 37.7% |
99
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | 138% | 121% |
100
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | 123% | 80.5% |
101
+ | [syvai/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 78.2% | 72.6% |
102
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 46.4 % | 57.4% |
103
+
104
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2/resolve/main/images/cer_comparison-conv.png">
105
+
106
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2/resolve/main/images/wer_comparison-conv.png">
107
+
108
+
109
+
110
+ ### Read-aloud CoRal Performance
111
+
112
+
113
+ | Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) WER |
114
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
115
+ | CoRal-project/roest-wav2vec2-2B-v2 (This model) | 2B | Read-aloud and conversation | 6.2% ± 0.2% | 16.0% ± 0.4% |
116
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | 6.5% ± 0.2% | 16.4% ± 0.4% |
117
+ | [CoRal-project/roest-wav2vec2-315m-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
118
+ | [CoRal-project/roest-whisper-large-v1](https://huggingface.co/CoRal-project/roest-whisper-large-v1) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
119
+ | [CoRal-project/roest-wav2vec2-315M-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315M-v1) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
120
+ | [syvai/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
121
+ | [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
122
+
123
+
124
+ **OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
125
+
126
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2/resolve/main/images/cer_comparison-read-aloud.png">
127
+
128
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2/resolve/main/images/wer_comparison-read-aloud.png">
129
+
130
+
131
+ <details>
132
+ <summary>
133
+ <b>Detailed CER scores in % of evaluation across demographics on the CoRal test data</b>
134
+ </summary>
135
+
136
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
137
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
138
+ | female | 12.3 | 5.4 | 5.1 | 7.4 | 7.2 | 7.3 | 7.2 |
139
+ | male | 10.6 | 4.1 | 3.6 | 5.8 | 5.7 | 5.8 | 5.3 |
140
+ | 0-25 | 9.1 | 3.8 | 3.4 | 5.4 | 5.3 | 5.1 | 4.7 |
141
+ | 25-50 | 11.4 | 4.7 | 4.0 | 6.2 | 6.0 | 5.7 | 5.3 |
142
+ | 50+ | 12.4 | 5.2 | 5.0 | 7.5 | 7.4 | 7.8 | 7.7 |
143
+ | Bornholmsk | 12.1 | 3.8 | 3.8 | 6.8 | 6.1 | 6.2 | 5.7 |
144
+ | Fynsk | 12.0 | 5.9 | 5.1 | 7.4 | 7.2 | 6.9 | 6.1 |
145
+ | Københavnsk | 5.6 | 2.1 | 1.9 | 3.3 | 3.2 | 3.0 | 2.6 |
146
+ | Non-native | 17.4 | 5.9 | 4.8 | 7.8 | 7.5 | 7.3 | 6.6 |
147
+ | Nordjysk | 4.7 | 1.5 | 1.6 | 2.6 | 2.8 | 2.6 | 2.3 |
148
+ | Sjællandsk | 8.0 | 3.3 | 3.0 | 4.4 | 4.5 | 3.9 | 3.8 |
149
+ | Sydømål | 7.7 | 4.3 | 4.1 | 6.4 | 6.4 | 6.5 | 5.8 |
150
+ | Sønderjysk | 20.0 | 9.4 | 8.8 | 11.9 | 11.6 | 12.6 | 13.3 |
151
+ | Vestjysk | 17.6 | 7.2 | 6.4 | 10.1 | 9.8 | 10.5 | 10.8 |
152
+ | Østjysk | 5.9 | 2.9 | 2.6 | 4.0 | 4.1 | 3.8 | 3.5 |
153
+ | Overall | 11.4 | 4.7 | 4.3 | 6.6 | 6.5 | 6.5 | 6.2 |
154
+
155
+ </details>
156
+
157
+ <details>
158
+ <summary>
159
+ <b>Detailed WER scores in % of evaluation across demographics on the CoRal test data</b>
160
+ </summary>
161
+
162
+ | Category | whisper-large-v3 | hviske-v2 | røst-whisper-large-v1 | røst-wav2vec2-315m-v1 | røst-wav2vec2-315m-v2 | røst-wav2vec2-1B-v2 | røst-wav2vec2-2B-v2 |
163
+ | :---------: | :--------------: | :-------: | :-------------------: | :-------------------: | :-------------------: | :-----------------: | :-----------------: |
164
+ | female | 30.2 | 12.7 | 11.5 | 18.5 | 17.7 | 17.8 | 17.8 |
165
+ | male | 26.5 | 10.9 | 9.4 | 15.5 | 14.9 | 15.0 | 14.3 |
166
+ | 0-25 | 24.1 | 10.3 | 9.0 | 14.7 | 14.0 | 13.7 | 12.9 |
167
+ | 25-50 | 28.4 | 12.2 | 10.1 | 16.6 | 15.8 | 15.3 | 14.5 |
168
+ | 50+ | 30.0 | 12.1 | 11.3 | 18.2 | 17.7 | 18.5 | 18.7 |
169
+ | Bornholmsk | 31.6 | 10.4 | 9.8 | 17.7 | 15.7 | 16.4 | 15.3 |
170
+ | Fynsk | 29.3 | 14.3 | 12.1 | 18.3 | 17.7 | 16.7 | 15.2 |
171
+ | Københavnsk | 16.8 | 6.7 | 5.9 | 10.2 | 10.0 | 9.5 | 8.4 |
172
+ | Non-native | 40.9 | 15.4 | 12.2 | 20.9 | 19.4 | 19.4 | 18.1 |
173
+ | Nordjysk | 13.5 | 4.3 | 4.5 | 7.7 | 7.5 | 7.3 | 6.9 |
174
+ | Sjællandsk | 21.7 | 8.9 | 7.6 | 12.6 | 12.7 | 11.0 | 10.5 |
175
+ | Sydømål | 19.2 | 10.4 | 10.0 | 14.9 | 15.3 | 14.4 | 13.7 |
176
+ | Sønderjysk | 44.3 | 19.0 | 17.5 | 26.0 | 25.4 | 27.8 | 29.6 |
177
+ | Vestjysk | 42.0 | 17.7 | 15.0 | 26.3 | 25.2 | 26.7 | 28.3 |
178
+ | Østjysk | 16.9 | 8.2 | 7.5 | 11.7 | 11.3 | 10.8 | 10.1 |
179
+ | Overall | 28.3 | 11.8 | 10.4 | 17.0 | 16.3 | 16.4 | 16.0 |
180
+
181
+ </details>
182
+
183
+ <details>
184
+ <summary>
185
+ <b>Experiments with Røst-wav2vec2 with and without language model</b>
186
+ </summary>
187
+
188
+ The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1).
189
+
190
+ | Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.com/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
191
+ | :-------------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | ---------------------------------------------------------------------------------------: |
192
+ | CoRal-project/roest-wav2vec2-2B-v2 | 2B | Read-aloud and conversation | Yes | **6.2% ± 0.2%** | **16.0% ± 0.4%** |
193
+ | CoRal-project/roest-wav2vec2-2B-v2 | 2B | Read-aloud and conversation | No | 7.8% ± 0.2% | 23.0% ± 0.4% |
194
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.4% ± 0.4%** |
195
+ | [CoRal-project/roest-wav2vec2-1B-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-1B-v2) | 1B | Read-aloud and conversation | No | 8.1% ± 0.2% | 23.9% ± 0.4% |
196
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
197
+ | [CoRal-project/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
198
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
199
+ | [CoRal-project/roest-wav2vec2-315m-v1](https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v1) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
200
+
201
+ </details>
202
+
203
+
204
+ ### Performance on Other Datasets
205
+
206
+ The model was also tested against other datasets to evaluate generalizability:
207
+
208
+ | | **Røst-whisper-large-v1** | | **Røst-wav2vec2-315M-v1** | | **Røst-wav2vec2-315M-v2** | | **Røst-wav2vec2-1B-v2** | | **Røst-wav2vec2-2B-v2** | |
209
+ | ------------------------------------------------------------------------------------- | ------------------------- | --------- | ------------------------- | --------- | ------------------------- | --------- | ----------------------- | --------- | ----------------------- | --------- |
210
+ | **Evaluation Dataset** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** | **WER %** | **CER %** |
211
+ | [CoRal](https://huggingface.co/datasets/CoRal-project/coral/viewer/read_aloud/test) | **10.4** | **4.3** | 17.0 | 6.6 | 16.3 | 6.5 | 16.4 | 6.5 | 16.0 | 6.2 |
212
+ | [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.8 | 14.5 | 29.7 | 13.9 | 28.4 | 12.4 | 27.7 | 11.9 | **27.0** | **11.7** |
213
+ | [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 15.6 | 8.2 | 16.7 | 6.6 | 14.4 | 5.4 | 26.3 | 10.9 | **12.0** | **4.5** |
214
+ | [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 12.6 | **5.1** | 16.6 | 6.3 | 15.6 | 6.1 | 13.7 | 5.5 | **12.5** | **5.1** |
215
+ | [AppenOss](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 9.2 | 3.9 | 14.8 | 6.0 | 11.3 | 4.4 | 9.1 | 3.6 | **8.1** | **3.1** |
216
+ | [AppenWiki](https://huggingface.co/datasets/Alvenir/alvenir_asr_da_eval) | 7.5 | 2.8 | 7.9 | 3.0 | 8.0 | 3.0 | 7.2 | 2.7 | **6.5** | **2.4** |
217
+
218
+ **OBS!** The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.
219
+
220
+ ---
221
+
222
+ ### Note on comparing whisper and wav2vec2 models
223
+ The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.
224
+
225
+ The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.
226
+
227
+ ---
228
+
229
+ ## Training curves
230
+ <img src="https://huggingface.co/CoRal-project/roest-wav2vec2-315m-v2/resolve/main/images/training_plots.png">
231
+
232
+ ---
233
+
234
+ ## Creators and Funders
235
+ This model has been trained and the model card written by Marie Juhl Jørgensen at [Alvenir](https://www.alvenir.ai/).
236
+
237
+ The CoRal project is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/) and consists of the following partners:
238
+
239
+ - [Alexandra Institute](https://alexandra.dk/)
240
+ - [University of Copenhagen](https://www.ku.dk/)
241
+ - [Agency for Digital Government](https://digst.dk/)
242
+ - [Alvenir](https://www.alvenir.ai/)
243
+ - [Corti](https://www.corti.ai/)
244
+
245
+ We would like specifically to thank [Dan Saattrup Nielsen](https://huggingface.co/saattrupdan), [Alexandra Institute](https://alexandra.dk/) for (among other things) the repository work and [Simon Leminen Madsen](https://huggingface.co/Leminen), [Alexandra Institute](https://alexandra.dk/) for modelling work.
246
+
247
+
248
+ ## Citation
249
+ ```bibtex
250
+ @misc{roest-wav2vec2-2B-v2,
251
+ author = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
252
+ title = {Røst-wav2vec-2B-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
253
+ year = {2025},
254
+ url = {https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2},
255
+ }
256
+ ```
added_tokens.json ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ {
2
+ "</s>": 43,
3
+ "<pad>": 45,
4
+ "<s>": 42,
5
+ "<unk>": 44
6
+ }
alphabet.json ADDED
File without changes
config.json ADDED
File without changes
images/cer_comparison-conv.png ADDED
images/cer_comparison-read-aloud.png ADDED
images/wer_comparison-conv.png ADDED
images/wer_comparison-read-aloud.png ADDED
language_model/3gram.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3ec877a2f9dad4e51bfcbdd0e32884b64a7662f722c7f37c77ea91dc3dea65db
3
+ size 750711338
language_model/attrs.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f5ffd02e1ceef6517476e72ebe7997ddef7e92d27cb5a23d6695d64c4317d6ad
3
+ size 78
language_model/unigrams.txt ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:683060ef402a6d88def5dc3ff15518b4d44e50ccb7ac12aad81a258d88fb5a72
3
+ size 29668511
model-00001-of-00002.safetensors ADDED
File without changes
model-00002-of-00002.safetensors ADDED
File without changes
model.safetensors.index.json ADDED
File without changes
preprocessor_config.json ADDED
File without changes
special_tokens_map.json ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": {
3
+ "content": "<s>",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "eos_token": {
10
+ "content": "</s>",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "<pad>",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "unk_token": {
24
+ "content": "<unk>",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ }
30
+ }
tokenizer_config.json ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "42": {
4
+ "content": "<s>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "43": {
12
+ "content": "</s>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "44": {
20
+ "content": "<unk>",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "45": {
28
+ "content": "<pad>",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ }
35
+ },
36
+ "bos_token": "<s>",
37
+ "clean_up_tokenization_spaces": false,
38
+ "do_lower_case": false,
39
+ "eos_token": "</s>",
40
+ "extra_special_tokens": {},
41
+ "model_max_length": 512,
42
+ "pad_token": "<pad>",
43
+ "processor_class": "Wav2Vec2Processor",
44
+ "replace_word_delimiter_char": " ",
45
+ "target_lang": null,
46
+ "tokenizer_class": "Wav2Vec2CTCTokenizer",
47
+ "unk_token": "<unk>",
48
+ "word_delimiter_token": "|"
49
+ }
vocab.json ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "0": 0,
3
+ "1": 1,
4
+ "2": 2,
5
+ "3": 3,
6
+ "4": 4,
7
+ "5": 5,
8
+ "6": 6,
9
+ "7": 7,
10
+ "8": 8,
11
+ "9": 9,
12
+ "a": 10,
13
+ "b": 11,
14
+ "c": 12,
15
+ "d": 13,
16
+ "e": 14,
17
+ "f": 15,
18
+ "g": 16,
19
+ "h": 17,
20
+ "i": 18,
21
+ "j": 19,
22
+ "k": 20,
23
+ "l": 21,
24
+ "m": 22,
25
+ "n": 23,
26
+ "o": 24,
27
+ "p": 25,
28
+ "q": 26,
29
+ "r": 27,
30
+ "s": 28,
31
+ "t": 29,
32
+ "u": 30,
33
+ "v": 31,
34
+ "w": 32,
35
+ "x": 33,
36
+ "y": 34,
37
+ "z": 35,
38
+ "|": 36,
39
+ "å": 37,
40
+ "æ": 38,
41
+ "é": 39,
42
+ "ø": 40,
43
+ "ü": 41
44
+ }