File size: 14,700 Bytes
79efffa c730851 fead2e1 47ffa63 fead2e1 c40a586 a5f7c0f 481f6a2 a5f7c0f 71cc33a a5f7c0f 71cc33a a5f7c0f c730851 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 |
---
datasets:
- CoRal-dataset/coral-v2
language:
- da
base_model:
- facebook/wav2vec2-xls-r-300m
metrics:
- wer
- cer
license: openrail
pipeline_tag: automatic-speech-recognition
model-index:
- name: roest-wav2vec2-315m-v2
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: CoRal read-aloud
type: alexandrainst/coral
split: test
args: read_aloud
metrics:
- type: cer
value: 6.5% ± 0.2%
name: CER
- type: wer
value: 16.3% ± 0.4%
name: WER
---
This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
## Overview
This repository contains the Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
## Quick Start
Start by installing the required libraries:
```shell
$ pip install transformers kenlm pyctcdecode
```
Next you can use the model using the `transformers` Python package as follows:
```python
>>> from transformers import pipeline
>>> audio = get_audio() # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-dataset/roest-wav2vec2-315m-v2")
>>> transcriber(audio)
{'text': 'your transcription'}
```
## Model Details
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
```
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
```
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
## Dataset
### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
- **Subsets**:
- Conversation
- Read-aloud
- **Language**: Danish.
- **Variation**: Includes various dialects, age groups, and gender distinctions.
### License
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).
## Evaluation
The model was evaluated using the following metrics:
- **Word Error Rate (WER)**: The percentage of words incorrectly transcribed.
- **Character Error Rate (CER)**: The percentage of characters incorrectly transcribed.
**OBS!** It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test).
| Model | Number of parameters | Finetuned on data of type | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
| :----------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
| [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) | 1540M | Read-aloud and conversation | 5.3% ± 0.2% | 12.0% ± 0.4% |
| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large) | 1540M | Read-aloud | **4.3% ± 0.2%** | **10.4% ± 0.3%** |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
**OBS!** Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
### Detailed evaluation across demographics on the CoRal test data
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">
### Table CER scores in % of evaluation across demographics on the CoRal test data
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
|:---:|:---:|:---:|:---:|:---:|
| female | 7.2 | 7.4 | 6.9 | 5.1 |
| male | 5.7 | 5.8 | 3.7 | 3.6 |
| 0-25 | 5.3 | 5.4 | 3.3 | 3.4 |
| 25-50 | 6.0 | 6.2 | 6.5 | 4.0 |
| 50+ | 7.4 | 7.5 | 5.1 | 5.0 |
| Bornholmsk | 6.1 | 6.8 | 3.4 | 3.8 |
| Fynsk | 7.2 | 7.4 | 13.8 | 5.1 |
| Københavnsk | 3.2 | 3.3 | 2.1 | 1.9 |
| Non-native | 7.5 | 7.8 | 4.9 | 4.8 |
| Nordjysk | 2.8 | 2.6 | 1.7 | 1.6 |
| Sjællandsk | 4.5 | 4.4 | 2.9 | 3.0 |
| Sydømål | 6.4 | 6.4 | 4.1 | 4.1 |
| Sønderjysk | 11.6 | 11.9 | 8.8 | 8.8 |
| Vestjysk | 9.8 | 10.1 | 6.9 | 6.4 |
| Østjysk | 4.1 | 4.0 | 2.8 | 2.6 |
| Overall | 6.5 | 6.6 | 5.3 | 4.3 |
### Table WER scores in % of evaluation across demographics on the CoRal test data
| Category | roest-wav2vec2-315m-v2 | roest-315m | roest-whisper-large-v2 | coral-1-whisper-large |
|:---:|:---:|:---:|:---:|:---:|
| female | 17.7 | 18.5 | 14.2 | 11.5 |
| male | 14.9 | 15.5 | 9.9 | 9.4 |
| 0-25 | 14.0 | 14.7 | 9.0 | 9.0 |
| 25-50 | 15.8 | 16.6 | 14.1 | 10.1 |
| 50+ | 17.7 | 18.2 | 11.5 | 11.3 |
| Bornholmsk | 15.7 | 17.7 | 9.3 | 9.8 |
| Fynsk | 17.7 | 18.3 | 24.9 | 12.1 |
| Københavnsk | 10.0 | 10.2 | 6.7 | 5.9 |
| Non-native | 19.4 | 20.9 | 13.0 | 12.2 |
| Nordjysk | 7.5 | 7.7 | 4.9 | 4.5 |
| Sjællandsk | 12.7 | 12.6 | 7.5 | 7.6 |
| Sydømål | 15.3 | 14.9 | 10.3 | 10.0 |
| Sønderjysk | 25.4 | 26.0 | 17.4 | 17.5 |
| Vestjysk | 25.2 | 26.3 | 16.3 | 15.0 |
| Østjysk | 11.3 | 11.7 | 8.0 | 7.5 |
| Overall | 16.3 | 17.0 | 12.0 | 10.4 |
### Roest-wav2vec2-315M with and without language model
The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
| Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER | [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER |
| :-------------------------------------------------------------------------------------------- | -------------------: | --------------------------: | --------------------------------: | --------------------------------------------------------------------------------------: | --------------------------------------------------------------------------------------: |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | Yes | **6.5% ± 0.2%** | **16.3% ± 0.4%** |
| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
### Detailed Roest-wav2vec2-315M with and without language model on different dialects
Here are the results of the model on different danish dialects in the test set:
| | Roest-1 | | Roest-1 | | Roest-2 | | Roest-2 | |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
| LM | No | | Yes | | No | | Yes | |
|-------------|---------|---------|---------|---------|---------|---------|---------|---------|
| Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
| Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
| Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
| Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
| Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
| Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
| Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
| Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
| Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
| Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
| Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
### Performance on Other Datasets
The model was also tested against other datasets to evaluate generalizability:
| | **Roest-wav2vec2-315M-v1** | | **Roest-wav2vec2-315M-v2** | |
| ------------------------------------------------------------------------------------- | ----------- | ----- | ----------- | -------- |
| Evaluation Dataset | WER % | CER % | WER % | CER % |
| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) | 17.0 | 6.6 | **16.3** | **6.5** |
| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) | 29.7 | 13.9 | **26.1** | **11.9** |
| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) | 16.7 | 6.6 | **14.4** | **5.4** |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) | 27.3 | 7.9 | **26.4** | **7.7** |
| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) Normed | 16.6 | 6.3 | **15.6** | **6.1** |
## Training curves
<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/training_plots.png">
## Creators and Funders
This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at [Alvenir](https://www.alvenir.ai/).
The CoRal project is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/) and consists of the following partners:
- [Alexandra Institute](https://alexandra.dk/)
- [University of Copenhagen](https://www.ku.dk/)
- [Agency for Digital Government](https://digst.dk/)
- [Alvenir](https://www.alvenir.ai/)
- [Corti](https://www.corti.ai/)
We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work. |