datasets:
- CoRal-dataset/coral-v2
language:
- da
base_model:
- facebook/wav2vec2-xls-r-300m
metrics:
- wer
- cer
license: openrail
pipeline_tag: automatic-speech-recognition
model-index:
- name: roest-wav2vec2-315m-v2
results:
- task:
type: automatic-speech-recognition
name: Automatic Speech Recognition
dataset:
name: CoRal read-aloud
type: alexandrainst/coral
split: test
args: read_aloud
metrics:
- type: cer
value: 6.5% ± 0.2%
name: CER
- type: wer
value: 16.3% ± 0.4%
name: WER
This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by Alvenir.
Overview
This repository contains the Wav2Vec2 model trained on the CoRal-v2 dataset. The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).
Quick Start
Start by installing the required libraries:
$ pip install transformers kenlm pyctcdecode
Next you can use the model using the transformers
Python package as follows:
>>> from transformers import pipeline
>>> audio = get_audio() # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-dataset/roest-wav2vec2-315m-v2")
>>> transcriber(audio)
{'text': 'your transcription'}
Transcription examples
Example 1
Dialect: Vestjysk
Transcription: det blev til yderlig ti mål i den første sæson på trods af en position som back
Target transcription: det blev til yderligere ti mål i den første sæson på trods af en position som back
CER: 3.7%
WER: 5.9%
Example 2
Dialect: Sønderjysk
Transcription: en arkitektoniske udformning af pladser forslagene iver benzen
Target transcription: den arkitektoniske udformning af pladsen er forestået af ivar bentsen
CER: 20.3%
WER: 60.0%
Example 3
Dialect: Nordsjællandsk
Transcription: østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
Target transcription: østrig og ungarn samarbejder om søen gennem den østrigske og ungarske vandkommission
CER: 0.0%
WER: 0.0%
Example 4
Dialect: Lollandsk
Transcription: det er produceret af thomas helme og indspillede i easy sound recording studio i københavn
Target transcription: det er produceret af thomas helmig og indspillet i easy sound recording studio i københavn
CER: 4.4%
WER: 13.3%
Model Details
Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained Wav2Vec2-XLS-R-300M has been fine-tuned for automatic speech recognition with the CoRal-v2 dataset dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the CoRaL repository by running:
python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by alexandrainst/roest-wav2vec2-315m-v1.
Dataset
CoRal-v2
- Subsets:
- Conversation
- Read-aloud
- Language: Danish.
- Variation: Includes various dialects, age groups, and gender distinctions.
License
Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See license.
Evaluation
The model was evaluated using the following metrics:
- Word Error Rate (WER): The percentage of words incorrectly transcribed.
- Character Error Rate (CER): The percentage of characters incorrectly transcribed.
OBS! It should be noted that the CoRal test dataset does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the CoRal test dataset.
Model | Number of parameters | Finetuned on data of type | CoRal CER | CoRal WER |
---|---|---|---|---|
CoRal-dataset/roest-wav2vec2-315M-v2 | 315M | Read-aloud and conversation | 6.5% ± 0.2% | 16.3% ± 0.4% |
CoRal-dataset/roest-whisper-large-v2 | 1540M | Read-aloud and conversation | 5.3% ± 0.2% | 12.0% ± 0.4% |
Alvenir/roest-whisper-large-v1 | 1540M | Read-aloud | 4.3% ± 0.2% | 10.4% ± 0.3% |
alexandrainst/roest-wav2vec2-315M-v1 | 315M | Read-aloud | 6.6% ± 0.2% | 17.0% ± 0.4% |
mhenrichsen/hviske-v2 | 1540M | Read-aloud | 4.7% ± 0.2% | 11.8% ± 0.3% |
openai/whisper-large-v3 | 1540M | - | 11.4% ± 0.3% | 28.3% ± 0.6% |
OBS! Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.
Detailed evaluation across demographics on the CoRal test data


Table CER scores in % of evaluation across demographics on the CoRal test data
Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
---|---|---|---|---|
female | 7.2 | 7.4 | 6.9 | 5.1 |
male | 5.7 | 5.8 | 3.7 | 3.6 |
0-25 | 5.3 | 5.4 | 3.3 | 3.4 |
25-50 | 6.0 | 6.2 | 6.5 | 4.0 |
50+ | 7.4 | 7.5 | 5.1 | 5.0 |
Bornholmsk | 6.1 | 6.8 | 3.4 | 3.8 |
Fynsk | 7.2 | 7.4 | 13.8 | 5.1 |
Københavnsk | 3.2 | 3.3 | 2.1 | 1.9 |
Non-native | 7.5 | 7.8 | 4.9 | 4.8 |
Nordjysk | 2.8 | 2.6 | 1.7 | 1.6 |
Sjællandsk | 4.5 | 4.4 | 2.9 | 3.0 |
Sydømål | 6.4 | 6.4 | 4.1 | 4.1 |
Sønderjysk | 11.6 | 11.9 | 8.8 | 8.8 |
Vestjysk | 9.8 | 10.1 | 6.9 | 6.4 |
Østjysk | 4.1 | 4.0 | 2.8 | 2.6 |
Overall | 6.5 | 6.6 | 5.3 | 4.3 |
Table WER scores in % of evaluation across demographics on the CoRal test data
Category | roest-wav2vec2-315m-v2 | roest-wav2vec2-315m-v1 | roest-whisper-large-v2 | roest-whisper-large-v1 |
---|---|---|---|---|
female | 17.7 | 18.5 | 14.2 | 11.5 |
male | 14.9 | 15.5 | 9.9 | 9.4 |
0-25 | 14.0 | 14.7 | 9.0 | 9.0 |
25-50 | 15.8 | 16.6 | 14.1 | 10.1 |
50+ | 17.7 | 18.2 | 11.5 | 11.3 |
Bornholmsk | 15.7 | 17.7 | 9.3 | 9.8 |
Fynsk | 17.7 | 18.3 | 24.9 | 12.1 |
Københavnsk | 10.0 | 10.2 | 6.7 | 5.9 |
Non-native | 19.4 | 20.9 | 13.0 | 12.2 |
Nordjysk | 7.5 | 7.7 | 4.9 | 4.5 |
Sjællandsk | 12.7 | 12.6 | 7.5 | 7.6 |
Sydømål | 15.3 | 14.9 | 10.3 | 10.0 |
Sønderjysk | 25.4 | 26.0 | 17.4 | 17.5 |
Vestjysk | 25.2 | 26.3 | 16.3 | 15.0 |
Østjysk | 11.3 | 11.7 | 8.0 | 7.5 |
Overall | 16.3 | 17.0 | 12.0 | 10.4 |
Roest-wav2vec2-315M with and without language model
The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by alexandrainst/roest-wav2vec2-315m-v1.
Model | Number of parameters | Finetuned on data of type | Postprocessed with Language Model | CoRal CER | CoRal WER |
---|---|---|---|---|---|
CoRal-dataset/roest-wav2vec2-315M-v2 | 315M | Read-aloud and conversation | Yes | 6.5% ± 0.2% | 16.3% ± 0.4% |
CoRal-dataset/roest-wav2vec2-315M-v2 | 315M | Read-aloud and conversation | No | 8.2% ± 0.2% | 25.1% ± 0.4% |
alexandrainst/roest-wav2vec2-315m-v1 | 315M | Read-aloud | Yes | 6.6% ± 0.2% | 17.0% ± 0.4% |
alexandrainst/roest-wav2vec2-315m-v1 | 315M | Read-aloud | No | 8.6% ± 0.2% | 26.3% ± 0.5% |
Detailed Roest-wav2vec2-315M with and without language model on different dialects
Here are the results of the model on different danish dialects in the test set:
Roest-v1 | Roest-v1 | Roest-v2 | Roest-v2 | |||||
---|---|---|---|---|---|---|---|---|
LM | No | Yes | No | Yes | ||||
------------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- | --------- |
Dialect | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) | CER (%) | WER (%) |
Vestjysk | 12.7 | 37.1 | 10.1 | 26.3 | 12.2 | 36.3 | 9.82 | 25.2 |
Sønderjysk | 14.7 | 37.8 | 11.9 | 26.0 | 14.2 | 36.2 | 11.6 | 25.4 |
Bornholmsk | 9.32 | 29.9 | 6.79 | 17.7 | 8.08 | 26.7 | 6.12 | 15.7 |
Østjysk | 5.51 | 18.7 | 3.97 | 11.7 | 5.39 | 18.0 | 4.06 | 11.3 |
Nordjysk | 3.86 | 13.6 | 2.57 | 7.72 | 3.80 | 13.5 | 2.75 | 7.51 |
Københavnsk | 5.27 | 18.8 | 3.31 | 10.2 | 5.02 | 17.7 | 3.20 | 9.98 |
Fynsk | 9.41 | 28.6 | 7.43 | 18.3 | 8.86 | 27.0 | 7.20 | 17.7 |
Non-native | 10.6 | 33.2 | 7.84 | 20.9 | 10.0 | 31.6 | 7.46 | 19.4 |
Sjællandsk | 5.82 | 19.5 | 4.44 | 12.6 | 5.70 | 18.6 | 4.48 | 12.7 |
Sydømål | 7.09 | 20.7 | 6.38 | 14.9 | 6.96 | 20.4 | 6.44 | 15.3 |
Performance on Other Datasets
The model was also tested against other datasets to evaluate generalizability:
Roest-wav2vec2-315M-v1 | Roest-wav2vec2-315M-v2 | |||
---|---|---|---|---|
Evaluation Dataset | WER % | CER % | WER % | CER % |
CoRal | 17.0 | 6.6 | 16.3 | 6.5 |
NST-da | 29.7 | 13.9 | 26.1 | 11.9 |
CommonVoice17 | 16.7 | 6.6 | 14.4 | 5.4 |
Fleurs-da_dk | 27.3 | 7.9 | 26.4 | 7.7 |
Fleurs-da_dk Normed | 16.6 | 6.3 | 15.6 | 6.1 |
Training curves

Creators and Funders
This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at Alvenir.
The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:
We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.