File size: 14,700 Bytes
79efffa
 
 
 
 
 
 
c730851
 
 
fead2e1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
47ffa63
fead2e1
c40a586
 
a5f7c0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
481f6a2
a5f7c0f
71cc33a
a5f7c0f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71cc33a
a5f7c0f
 
 
 
 
 
 
 
 
 
 
 
c730851
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
---
datasets:
- CoRal-dataset/coral-v2
language:
- da
base_model:
- facebook/wav2vec2-xls-r-300m
metrics:
- wer
- cer
license: openrail
pipeline_tag: automatic-speech-recognition
model-index:
- name: roest-wav2vec2-315m-v2
  results:
  - task:
      type: automatic-speech-recognition
      name: Automatic Speech Recognition
    dataset:
      name: CoRal read-aloud
      type: alexandrainst/coral
      split: test
      args: read_aloud
    metrics:
    - type: cer
      value: 6.5% ± 0.2%
      name: CER
    - type: wer
      value: 16.3% ± 0.4%
      name: WER
---

This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
## Overview

This repository contains the Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

## Quick Start

Start by installing the required libraries:

```shell
$ pip install transformers kenlm pyctcdecode
```

Next you can use the model using the `transformers` Python package as follows:

```python
>>> from transformers import pipeline
>>> audio = get_audio()  # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-dataset/roest-wav2vec2-315m-v2")
>>> transcriber(audio)
{'text': 'your transcription'}
```

## Model Details

Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
```
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
```
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
## Dataset

### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
- **Subsets**: 
	- Conversation
	- Read-aloud
- **Language**: Danish.
- **Variation**: Includes various dialects, age groups, and gender distinctions.
### License
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).

## Evaluation

The model was evaluated using the following metrics:
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.

**OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test).


| Model                                                                                            | Number of parameters |   Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                 315M | Read-aloud and conversation |                                                                             6.5% ± 0.2% |                                                                            16.3% ± 0.4% |
| [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) |                1540M | Read-aloud and conversation |                                                                             5.3%  ± 0.2%            |                                                                               12.0% ± 0.4%          |
| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large)            |                1540M |                  Read-aloud |                                                                         **4.3% ± 0.2%** |                                                                        **10.4% ± 0.3%** |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                      |                 315M |                  Read-aloud |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2)                                  |                1540M |                  Read-aloud |                                                                             4.7% ± 0.2% |                                                                            11.8% ± 0.3% |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3)                                 |                1540M |                           - |                                                                            11.4% ± 0.3% |                                                                            28.3% ± 0.6% |

**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than  reported in the model card.

### Detailed evaluation across demographics on the CoRal test data
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">

<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">

### Table CER scores in % of evaluation across demographics on the CoRal test data
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
|:---:|:---:|:---:|:---:|:---:|
| female | 7.2 | 7.4 | 6.9 | 5.1 |
| male | 5.7 | 5.8 | 3.7 | 3.6 |
| 0-25 | 5.3 | 5.4 | 3.3 | 3.4 |
| 25-50 | 6.0 | 6.2 | 6.5 | 4.0 |
| 50+ | 7.4 | 7.5 | 5.1 | 5.0 |
| Bornholmsk | 6.1 | 6.8 | 3.4 | 3.8 |
| Fynsk | 7.2 | 7.4 | 13.8 | 5.1 |
| Københavnsk | 3.2 | 3.3 | 2.1 | 1.9 |
| Non-native | 7.5 | 7.8 | 4.9 | 4.8 |
| Nordjysk | 2.8 | 2.6 | 1.7 | 1.6 |
| Sjællandsk | 4.5 | 4.4 | 2.9 | 3.0 |
| Sydømål | 6.4 | 6.4 | 4.1 | 4.1 |
| Sønderjysk | 11.6 | 11.9 | 8.8 | 8.8 |
| Vestjysk | 9.8 | 10.1 | 6.9 | 6.4 |
| Østjysk | 4.1 | 4.0 | 2.8 | 2.6 |
| Overall | 6.5 | 6.6 | 5.3 | 4.3 |

### Table WER scores in % of evaluation across demographics on the CoRal test data
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
|:---:|:---:|:---:|:---:|:---:|
| female | 17.7 | 18.5 | 14.2 | 11.5 |
| male | 14.9 | 15.5 | 9.9 | 9.4 |
| 0-25 | 14.0 | 14.7 | 9.0 | 9.0 |
| 25-50 | 15.8 | 16.6 | 14.1 | 10.1 |
| 50+ | 17.7 | 18.2 | 11.5 | 11.3 |
| Bornholmsk | 15.7 | 17.7 | 9.3 | 9.8 |
| Fynsk | 17.7 | 18.3 | 24.9 | 12.1 |
| Københavnsk | 10.0 | 10.2 | 6.7 | 5.9 |
| Non-native | 19.4 | 20.9 | 13.0 | 12.2 |
| Nordjysk | 7.5 | 7.7 | 4.9 | 4.5 |
| Sjællandsk | 12.7 | 12.6 | 7.5 | 7.6 |
| Sydømål | 15.3 | 14.9 | 10.3 | 10.0 |
| Sønderjysk | 25.4 | 26.0 | 17.4 | 17.5 |
| Vestjysk | 25.2 | 26.3 | 16.3 | 15.0 |
| Østjysk | 11.3 | 11.7 | 8.0 | 7.5 |
| Overall | 16.3 | 17.0 | 12.0 | 10.4 |


### Roest-wav2vec2-315M with and without language model
The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).

| Model                                                                                         | Number of parameters |   Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                               Yes |                                                                         **6.5% ± 0.2%** |                                                                        **16.3% ± 0.4%** |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) |                 315M | Read-aloud and conversation |                                No |                                                                             8.2% ± 0.2% |                                                                            25.1% ± 0.4% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                               Yes |                                                                             6.6% ± 0.2% |                                                                            17.0% ± 0.4% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m)                   |                 315M |                  Read-aloud |                                No |                                                                             8.6% ± 0.2% |                                                                            26.3% ± 0.5% |

### Detailed Roest-wav2vec2-315M with and without language model on different dialects
Here are the results of the model on different danish dialects in the test set:

|             | Roest-1 |         | Roest-1 |         | Roest-2 |         | Roest-2 |         |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
| LM          | No      |         | Yes     |         | No      |         | Yes     |         |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
| Dialect     | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
| Vestjysk    | 12.7    | 37.1    | 10.1    | 26.3    | 12.2    | 36.3    | 9.82    | 25.2    |
| Sønderjysk  | 14.7    | 37.8    | 11.9    | 26.0    | 14.2    | 36.2    | 11.6    | 25.4    |
| Bornholmsk  | 9.32    | 29.9    | 6.79    | 17.7    | 8.08    | 26.7    | 6.12    | 15.7    |
| Østjysk     | 5.51    | 18.7    | 3.97    | 11.7    | 5.39    | 18.0    | 4.06    | 11.3    |
| Nordjysk    | 3.86    | 13.6    | 2.57    | 7.72    | 3.80    | 13.5    | 2.75    | 7.51    |
| Københavnsk | 5.27    | 18.8    | 3.31    | 10.2    | 5.02    | 17.7    | 3.20    | 9.98    |
| Fynsk       | 9.41    | 28.6    | 7.43    | 18.3    | 8.86    | 27.0    | 7.20    | 17.7    |
| Non-native  | 10.6    | 33.2    | 7.84    | 20.9    | 10.0    | 31.6    | 7.46    | 19.4    |
| Sjællandsk  | 5.82    | 19.5    | 4.44    | 12.6    | 5.70    | 18.6    | 4.48    | 12.7    |
| Sydømål     | 7.09    | 20.7    | 6.38    | 14.9    | 6.96    | 20.4    | 6.44    | 15.3    |

### Performance on Other Datasets

The model was also tested against other datasets to evaluate generalizability:

|                                                                                       | **Roest-wav2vec2-315M-v1** |       | **Roest-wav2vec2-315M-v2** |          |
| ------------------------------------------------------------------------------------- | ----------- | ----- | ----------- | -------- |
| Evaluation Dataset                                                                    | WER %       | CER % | WER %       | CER %    |
| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test)   | 17.0        | 6.6   | **16.3**    | **6.5**  |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da)                        | 29.7        | 13.9  | **26.1**    | **11.9** |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 16.7        | 6.6   | **14.4**    | **5.4**  |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs)                         | 27.3        | 7.9   | **26.4**    | **7.7**  |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) Normed                  | 16.6        | 6.3   | **15.6**    | **6.1**  |

## Training curves
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/training_plots.png">

## Creators and Funders
This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at [Alvenir](https://www.alvenir.ai/).

The CoRal project is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/) and consists of the following partners:

- [Alexandra Institute](https://alexandra.dk/)
- [University of Copenhagen](https://www.ku.dk/)
- [Agency for Digital Government](https://digst.dk/)
- [Alvenir](https://www.alvenir.ai/)
- [Corti](https://www.corti.ai/)

We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.