Røst-wav2vec2-2B-v2

This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by Alvenir.

This repository contains a Wav2Vec2 model trained on the CoRal-v2 dataset. The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

The model has been evaluated comprehensively and røst-wav2vec2-2B-v2 has demonstrated superior performance on multiple test sets. It achieves the lowest error rates among all other models on the tentative CoRal-v2::conversation test set. Furthermore, it recieves the lowest errors on multiple zero-shot test sets, achieving new state-of-the-art results in Danish ASR technology.

Quick Start

Start by installing the required libraries:

$ pip install transformers kenlm pyctcdecode

Next you can use the model using the transformers Python package as follows:

>>> from transformers import pipeline
>>> audio = get_audio()  # 16kHz raw audio array
>>> transcriber = pipeline(model="CoRal-project/roest-wav2vec2-2B-v2")
>>> transcriber(audio)
{'text': 'your transcription'}

Model Details

Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained wav2vec2-xls-r-2b has been fine-tuned for automatic speech recognition with the CoRal-v2 dataset dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the CoRaL repository by running:

python src/scripts/finetune_asr_model.py  \
model=wav2vec2-large \
max_steps=30000 \
datasets.coral_conversation_internal.id=CoRal-project/coral-v2 \
datasets.coral_readaloud_internal.id=CoRal-project/coral-v2

The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by CoRal-project/roest-wav2vec2-315m-v1.

The model was trained on the CoRal-v2 dataset, including both the conversational and read-aloud subset. This dataset consists of Danish speech across a variety of dialects, age groups and gender distinctions. Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See license.


Evaluation

The model was evaluated using the following metrics:

  • Character Error Rate (CER): The percentage of characters incorrectly transcribed.
  • Word Error Rate (WER): The percentage of words incorrectly transcribed.

Zero-shot performance on open evaluation datasets

To evaluate generalizability, the model was evaluated against multiple open-source datasets. Each of the røst-wav2vec2-v2 models improved on the previous state-of-the-art (røst-whisper-large-v1), with the 2B model achieving new state-of-the-art results on all the zero-shot test sets. Røst-whisper-large-v1 still achieves lower error rates on the CoRal-v1 test set:

Røst-wav2vec2-2B-v2 Røst-wav2vec2-1B-v2 Røst-wav2vec2-315M-v2 Røst-wav2vec2-315M-v1 Røst-whisper-large-v1
Evaluation Dataset WER % CER % WER % CER % WER % CER % WER % CER % WER % CER %
CoRal-v1 16.0 6.2 16.4 6.5 16.3 6.5 17.0 6.6 10.4 4.3
NST-da 27.0 11.7 27.7 11.9 28.4 12.4 29.7 13.9 29.8 14.5
CommonVoice17 12.0 4.5 26.3 10.9 14.4 5.4 16.7 6.6 15.6 8.2
Fleurs-da_dk 12.5 5.1 13.7 5.5 15.6 6.1 16.6 6.3 12.6 5.1
AlvenirOss 8.1 3.1 9.1 3.6 11.3 4.4 14.8 6.0 9.2 3.9
AlvenirWiki 6.5 2.4 7.2 2.7 8.0 3.0 7.9 3.0 7.5 2.8

OBS! The vocab used for training incudes numerals (0,1,2,..,9), which are translated to text in a post-processing step. If the model misses spaces the numbers are interpreted as one, which especially affects the NST score as this dataset contains many numerals.

Conversational CoRal-v2 Performance

The model was firstly evaluated on a tentative version of the coral-v2 conversation dataset.

The results are tentative as the test set only includes 5 unique speakers, of which 4 are women. The test set includes 2 speakers with 'Fynsk' dialect, 1 with 'Sønderjysk', 1 with 'Non-native' and 1 'Nordjysk'.

Note that the high generalization error on conversation data for models trained on read-aloud data is still being analyzed.

Model Number of parameters Finetuned on data of type CoRal-v2::conversation CER CoRal-v2::conversation WER
CoRal-project/roest-wav2vec2-2B-v2 (This model) 2B Read-aloud and conversation 23.6% 34.3%
CoRal-project/roest-wav2vec2-1B-v2 1B Read-aloud and conversation 23.9% 36.7%
CoRal-project/roest-wav2vec2-315m-v2 315M Read-aloud and conversation 24.2% 37.7%
CoRal-project/roest-whisper-large-v1 1540M Read-aloud 138% 121%
CoRal-project/roest-wav2vec2-315m-v1 315M Read-aloud 123% 80.5%
syvai/hviske-v2 1540M Read-aloud 78.2% 72.6%
openai/whisper-large-v3 1540M - 46.4 % 57.4%

Read-aloud CoRal-v1 Performance

Model Number of parameters Finetuned on data of type CoRal-v1 CER CoRal-v1 WER
CoRal-project/roest-wav2vec2-2B-v2 (This model) 2B Read-aloud and conversation 6.2% ± 0.2% 16.0% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2 1B Read-aloud and conversation 6.5% ± 0.2% 16.4% ± 0.4%
CoRal-project/roest-wav2vec2-315m-v2 315M Read-aloud and conversation 6.5% ± 0.2% 16.3% ± 0.4%
CoRal-project/roest-whisper-large-v1 1540M Read-aloud 4.3% ± 0.2% 10.4% ± 0.3%
CoRal-project/roest-wav2vec2-315M-v1 315M Read-aloud 6.6% ± 0.2% 17.0% ± 0.4%
syvai/hviske-v2 1540M Read-aloud 4.7% ± 0.2% 11.8% ± 0.3%
openai/whisper-large-v3 1540M - 11.4% ± 0.3% 28.3% ± 0.6%

OBS! Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.

Detailed CER scores in % of evaluation across demographics on the CoRal-v1 read-aloud test data
Category whisper-large-v3 hviske-v2 røst-whisper-large-v1 røst-wav2vec2-315m-v1 røst-wav2vec2-315m-v2 røst-wav2vec2-1B-v2 røst-wav2vec2-2B-v2
female 12.3 5.4 5.1 7.4 7.2 7.3 7.2
male 10.6 4.1 3.6 5.8 5.7 5.8 5.3
0-25 9.1 3.8 3.4 5.4 5.3 5.1 4.7
25-50 11.4 4.7 4.0 6.2 6.0 5.7 5.3
50+ 12.4 5.2 5.0 7.5 7.4 7.8 7.7
Bornholmsk 12.1 3.8 3.8 6.8 6.1 6.2 5.7
Fynsk 12.0 5.9 5.1 7.4 7.2 6.9 6.1
Københavnsk 5.6 2.1 1.9 3.3 3.2 3.0 2.6
Non-native 17.4 5.9 4.8 7.8 7.5 7.3 6.6
Nordjysk 4.7 1.5 1.6 2.6 2.8 2.6 2.3
Sjællandsk 8.0 3.3 3.0 4.4 4.5 3.9 3.8
Sydømål 7.7 4.3 4.1 6.4 6.4 6.5 5.8
Sønderjysk 20.0 9.4 8.8 11.9 11.6 12.6 13.3
Vestjysk 17.6 7.2 6.4 10.1 9.8 10.5 10.8
Østjysk 5.9 2.9 2.6 4.0 4.1 3.8 3.5
Overall 11.4 4.7 4.3 6.6 6.5 6.5 6.2
Detailed WER scores in % of evaluation across demographics on the CoRal-v1 read-aloud test data
Category whisper-large-v3 hviske-v2 røst-whisper-large-v1 røst-wav2vec2-315m-v1 røst-wav2vec2-315m-v2 røst-wav2vec2-1B-v2 røst-wav2vec2-2B-v2
female 30.2 12.7 11.5 18.5 17.7 17.8 17.8
male 26.5 10.9 9.4 15.5 14.9 15.0 14.3
0-25 24.1 10.3 9.0 14.7 14.0 13.7 12.9
25-50 28.4 12.2 10.1 16.6 15.8 15.3 14.5
50+ 30.0 12.1 11.3 18.2 17.7 18.5 18.7
Bornholmsk 31.6 10.4 9.8 17.7 15.7 16.4 15.3
Fynsk 29.3 14.3 12.1 18.3 17.7 16.7 15.2
Københavnsk 16.8 6.7 5.9 10.2 10.0 9.5 8.4
Non-native 40.9 15.4 12.2 20.9 19.4 19.4 18.1
Nordjysk 13.5 4.3 4.5 7.7 7.5 7.3 6.9
Sjællandsk 21.7 8.9 7.6 12.6 12.7 11.0 10.5
Sydømål 19.2 10.4 10.0 14.9 15.3 14.4 13.7
Sønderjysk 44.3 19.0 17.5 26.0 25.4 27.8 29.6
Vestjysk 42.0 17.7 15.0 26.3 25.2 26.7 28.3
Østjysk 16.9 8.2 7.5 11.7 11.3 10.8 10.1
Overall 28.3 11.8 10.4 17.0 16.3 16.4 16.0
Experiments with Røst-wav2vec2 with and without language model

The inclusion of a post-processing language model can affect the performance significantly. The Røst-v1 and Røst-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by CoRal-project/roest-wav2vec2-315m-v1.

Model Number of parameters Finetuned on data of type Postprocessed with Language Model CoRal CER CoRal WER
CoRal-project/roest-wav2vec2-2B-v2 2B Read-aloud and conversation Yes 6.2% ± 0.2% 16.0% ± 0.4%
CoRal-project/roest-wav2vec2-2B-v2 2B Read-aloud and conversation No 7.8% ± 0.2% 23.0% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2 1B Read-aloud and conversation Yes 6.5% ± 0.2% 16.4% ± 0.4%
CoRal-project/roest-wav2vec2-1B-v2 1B Read-aloud and conversation No 8.1% ± 0.2% 23.9% ± 0.4%
CoRal-project/roest-wav2vec2-315M-v2 315M Read-aloud and conversation Yes 6.5% ± 0.2% 16.3% ± 0.4%
CoRal-project/roest-wav2vec2-315M-v2 315M Read-aloud and conversation No 8.2% ± 0.2% 25.1% ± 0.4%
CoRal-project/roest-wav2vec2-315m-v1 315M Read-aloud Yes 6.6% ± 0.2% 17.0% ± 0.4%
CoRal-project/roest-wav2vec2-315m-v1 315M Read-aloud No 8.6% ± 0.2% 26.3% ± 0.5%

Note on comparing whisper and wav2vec2 models

The Whisper models detailed in this model card exhibit significantly lower Character Error Rates (CER) and Word Error Rates (WER) compared to the Wav2Vec2 models. Whisper utilizes a transformer-based architecture with additional layers that enhance contextual understanding. In contrast, Wav2Vec2 models employ shorter context windows that focus on sound prediction. The Røst-Wav2Vec2 models incorporate a straightforward language model during post-processing, which addresses errors based on statistical language patterns. Introducing a more complex, contextual post-processing language model might enable a better comparison between these model types, which the CoRal project plans to explore in future releases.

The Røst-Whisper model excels in read-aloud data, leveraging its embedded contextual framework to achieve more robust recognition within this context. However, Wav2Vec2 models appear to generalize more effectively across various speech recognition tasks, whereas Whisper models incur higher error rates in conversational data. It’s important to note that the CoRal-v2 conversation dataset, being tentative and featuring limited speaker diversity, might influence these results.


Training curves


Creators and Funders

This model has been trained and the model card written by Marie Juhl Jørgensen at Alvenir.

The CoRal project is funded by the Danish Innovation Fund and consists of the following partners:

We would like specifically to thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.

Citation

@misc{roest-wav2vec2-2B-v2,
  author    = {Marie Juhl Jørgensen, Søren Vejlgaard Holm, Martin Carsten Nielsen, Dan Saattrup Nielsen, Sif Bernstorff Lehmann, Simon Leminen Madsen and Torben Blach},
  title     = {Røst-wav2vec-2B-v2: A Danish state-of-the-art speech recognition model trained on varied demographics and dialects},
  year      = {2025},
  url       = {https://huggingface.co/CoRal-project/roest-wav2vec2-2B-v2},
}
Downloads last month
33
Safetensors
Model size
2.16B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CoRal-project/roest-wav2vec2-2B-v2

Finetuned
(8)
this model

Dataset used to train CoRal-project/roest-wav2vec2-2B-v2

Evaluation results