Update README.md

47ffa63 verified 2 months ago

14.7 kB

	---
	datasets:
	- CoRal-dataset/coral-v2
	language:
	- da
	base_model:
	- facebook/wav2vec2-xls-r-300m
	metrics:
	- wer
	- cer
	license: openrail
	pipeline_tag: automatic-speech-recognition
	model-index:
	- name: roest-wav2vec2-315m-v2
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: CoRal read-aloud
	type: alexandrainst/coral
	split: test
	args: read_aloud
	metrics:
	- type: cer
	value: 6.5% ± 0.2%
	name: CER
	- type: wer
	value: 16.3% ± 0.4%
	name: WER
	---

	This is a Danish state-of-the-art speech recognition model, trained as part of the CoRal project by [Alvenir](https://www.alvenir.ai/).
	## Overview

	This repository contains the Wav2Vec2 model trained on the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main). The CoRal-v2 dataset includes a rich variety of Danish conversational and read-aloud data, distributed across diverse age groups, genders, and dialects. The model is designed for automatic speech recognition (ASR).

	## Quick Start

	Start by installing the required libraries:

	```shell
	$ pip install transformers kenlm pyctcdecode
	```

	Next you can use the model using the `transformers` Python package as follows:

	```python
	>>> from transformers import pipeline
	>>> audio = get_audio() # 16kHz raw audio array
	>>> transcriber = pipeline(model="CoRal-dataset/roest-wav2vec2-315m-v2")
	>>> transcriber(audio)
	{'text': 'your transcription'}
	```

	## Model Details

	Wav2Vec2 is a state-of-the-art model architecture for speech recognition, leveraging self-supervised learning from raw audio data. The pre-trained [Wav2Vec2-XLS-R-300M](facebook/wav2vec2-xls-r-300m) has been fine-tuned for automatic speech recognition with the [CoRal-v2 dataset](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main) dataset to enhance its performance in recognizing Danish speech with consideration to different dialects. The model was trained for 30K steps using the training setup in the [CoRaL repository](https://github.com/alexandrainst/coral/tree) by running:
	```
	python src/scripts/finetune_asr_model.py model=wav2vec2-small max_steps=30000 datasets.coral_conversation_internal.id=CoRal-dataset/coral-v2 datasets.coral_readaloud_internal.id=CoRal-dataset/coral-v2
	```
	The model is evaluated using a Language Model (LM) as post-processing. The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).
	## Dataset

	### [CoRal-v2](https://huggingface.co/datasets/CoRal-dataset/coral-v2/tree/main)
	- Subsets:
	- Conversation
	- Read-aloud
	- Language: Danish.
	- Variation: Includes various dialects, age groups, and gender distinctions.
	### License
	Note that the dataset used is licensed under a custom license, adapted from OpenRAIL-M, which allows commercial use with a few restrictions (speech synthesis and biometric identification). See [license](https://huggingface.co/Alvenir/coral-1-whisper-large/blob/main/LICENSE).

	## Evaluation

	The model was evaluated using the following metrics:
	- Word Error Rate (WER): The percentage of words incorrectly transcribed.
	- Character Error Rate (CER): The percentage of characters incorrectly transcribed.

	OBS! It should be noted that the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) does not contain any conversation data, whereas the model is trained for read-aloud and conversation, but is only tested on read-aloud in the [CoRal test dataset](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test).


	\| Model \| Number of parameters \| Finetuned on data of type \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER \|
	\| :----------------------------------------------------------------------------------------------- \| -------------------: \| --------------------------: \| --------------------------------------------------------------------------------------: \| --------------------------------------------------------------------------------------: \|
	\| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) \| 315M \| Read-aloud and conversation \| 6.5% ± 0.2% \| 16.3% ± 0.4% \|
	\| [CoRal-dataset/roest-whisper-large-v2](https://huggingface.co/CoRal-dataset/roest-whisper-large) \| 1540M \| Read-aloud and conversation \| 5.3% ± 0.2% \| 12.0% ± 0.4% \|
	\| [Alvenir/coral-1-whisper-large](https://huggingface.co/Alvenir/coral-1-whisper-large) \| 1540M \| Read-aloud \| 4.3% ± 0.2% \| 10.4% ± 0.3% \|
	\| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) \| 315M \| Read-aloud \| 6.6% ± 0.2% \| 17.0% ± 0.4% \|
	\| [mhenrichsen/hviske-v2](https://huggingface.co/syvai/hviske-v2) \| 1540M \| Read-aloud \| 4.7% ± 0.2% \| 11.8% ± 0.3% \|
	\| [openai/whisper-large-v3](https://hf.co/openai/whisper-large-v3) \| 1540M \| - \| 11.4% ± 0.3% \| 28.3% ± 0.6% \|

	OBS! Benchmark for hviske-v2 has been reevaluted and the confidence interval is larger than reported in the model card.

	### Detailed evaluation across demographics on the CoRal test data
	<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/wer.png">

	<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/cer.png">

	### Table CER scores in % of evaluation across demographics on the CoRal test data
	\| Category \| roest-wav2vec2-315m-v2 \| roest-315m \| roest-whisper-large-v2 \| coral-1-whisper-large \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| female \| 7.2 \| 7.4 \| 6.9 \| 5.1 \|
	\| male \| 5.7 \| 5.8 \| 3.7 \| 3.6 \|
	\| 0-25 \| 5.3 \| 5.4 \| 3.3 \| 3.4 \|
	\| 25-50 \| 6.0 \| 6.2 \| 6.5 \| 4.0 \|
	\| 50+ \| 7.4 \| 7.5 \| 5.1 \| 5.0 \|
	\| Bornholmsk \| 6.1 \| 6.8 \| 3.4 \| 3.8 \|
	\| Fynsk \| 7.2 \| 7.4 \| 13.8 \| 5.1 \|
	\| Københavnsk \| 3.2 \| 3.3 \| 2.1 \| 1.9 \|
	\| Non-native \| 7.5 \| 7.8 \| 4.9 \| 4.8 \|
	\| Nordjysk \| 2.8 \| 2.6 \| 1.7 \| 1.6 \|
	\| Sjællandsk \| 4.5 \| 4.4 \| 2.9 \| 3.0 \|
	\| Sydømål \| 6.4 \| 6.4 \| 4.1 \| 4.1 \|
	\| Sønderjysk \| 11.6 \| 11.9 \| 8.8 \| 8.8 \|
	\| Vestjysk \| 9.8 \| 10.1 \| 6.9 \| 6.4 \|
	\| Østjysk \| 4.1 \| 4.0 \| 2.8 \| 2.6 \|
	\| Overall \| 6.5 \| 6.6 \| 5.3 \| 4.3 \|

	### Table WER scores in % of evaluation across demographics on the CoRal test data
	\| Category \| roest-wav2vec2-315m-v2 \| roest-315m \| roest-whisper-large-v2 \| coral-1-whisper-large \|
	\|:---:\|:---:\|:---:\|:---:\|:---:\|
	\| female \| 17.7 \| 18.5 \| 14.2 \| 11.5 \|
	\| male \| 14.9 \| 15.5 \| 9.9 \| 9.4 \|
	\| 0-25 \| 14.0 \| 14.7 \| 9.0 \| 9.0 \|
	\| 25-50 \| 15.8 \| 16.6 \| 14.1 \| 10.1 \|
	\| 50+ \| 17.7 \| 18.2 \| 11.5 \| 11.3 \|
	\| Bornholmsk \| 15.7 \| 17.7 \| 9.3 \| 9.8 \|
	\| Fynsk \| 17.7 \| 18.3 \| 24.9 \| 12.1 \|
	\| Københavnsk \| 10.0 \| 10.2 \| 6.7 \| 5.9 \|
	\| Non-native \| 19.4 \| 20.9 \| 13.0 \| 12.2 \|
	\| Nordjysk \| 7.5 \| 7.7 \| 4.9 \| 4.5 \|
	\| Sjællandsk \| 12.7 \| 12.6 \| 7.5 \| 7.6 \|
	\| Sydømål \| 15.3 \| 14.9 \| 10.3 \| 10.0 \|
	\| Sønderjysk \| 25.4 \| 26.0 \| 17.4 \| 17.5 \|
	\| Vestjysk \| 25.2 \| 26.3 \| 16.3 \| 15.0 \|
	\| Østjysk \| 11.3 \| 11.7 \| 8.0 \| 7.5 \|
	\| Overall \| 16.3 \| 17.0 \| 12.0 \| 10.4 \|


	### Roest-wav2vec2-315M with and without language model
	The inclusion of a post-processing language model can affect the performance significantly. The Roest-v1 and Roest-v2 models are using the same Language Model (LM). The utilized LM is the one trained and used by [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m).

	\| Model \| Number of parameters \| Finetuned on data of type \| Postprocessed with Language Model \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) CER \| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) WER \|
	\| :-------------------------------------------------------------------------------------------- \| -------------------: \| --------------------------: \| --------------------------------: \| --------------------------------------------------------------------------------------: \| --------------------------------------------------------------------------------------: \|
	\| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) \| 315M \| Read-aloud and conversation \| Yes \| 6.5% ± 0.2% \| 16.3% ± 0.4% \|
	\| [CoRal-dataset/roest-wav2vec2-315M-v2](https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2) \| 315M \| Read-aloud and conversation \| No \| 8.2% ± 0.2% \| 25.1% ± 0.4% \|
	\| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) \| 315M \| Read-aloud \| Yes \| 6.6% ± 0.2% \| 17.0% ± 0.4% \|
	\| [alexandrainst/roest-315m](https://huggingface.co/alexandrainst/roest-315m) \| 315M \| Read-aloud \| No \| 8.6% ± 0.2% \| 26.3% ± 0.5% \|

	### Detailed Roest-wav2vec2-315M with and without language model on different dialects
	Here are the results of the model on different danish dialects in the test set:

	\| \| Roest-1 \| \| Roest-1 \| \| Roest-2 \| \| Roest-2 \| \|
	\|-------------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|
	\| LM \| No \| \| Yes \| \| No \| \| Yes \| \|
	\|-------------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|---------\|
	\| Dialect \| CER (%) \| WER (%) \| CER (%) \| WER (%) \| CER (%) \| WER (%) \| CER (%) \| WER (%) \|
	\| Vestjysk \| 12.7 \| 37.1 \| 10.1 \| 26.3 \| 12.2 \| 36.3 \| 9.82 \| 25.2 \|
	\| Sønderjysk \| 14.7 \| 37.8 \| 11.9 \| 26.0 \| 14.2 \| 36.2 \| 11.6 \| 25.4 \|
	\| Bornholmsk \| 9.32 \| 29.9 \| 6.79 \| 17.7 \| 8.08 \| 26.7 \| 6.12 \| 15.7 \|
	\| Østjysk \| 5.51 \| 18.7 \| 3.97 \| 11.7 \| 5.39 \| 18.0 \| 4.06 \| 11.3 \|
	\| Nordjysk \| 3.86 \| 13.6 \| 2.57 \| 7.72 \| 3.80 \| 13.5 \| 2.75 \| 7.51 \|
	\| Københavnsk \| 5.27 \| 18.8 \| 3.31 \| 10.2 \| 5.02 \| 17.7 \| 3.20 \| 9.98 \|
	\| Fynsk \| 9.41 \| 28.6 \| 7.43 \| 18.3 \| 8.86 \| 27.0 \| 7.20 \| 17.7 \|
	\| Non-native \| 10.6 \| 33.2 \| 7.84 \| 20.9 \| 10.0 \| 31.6 \| 7.46 \| 19.4 \|
	\| Sjællandsk \| 5.82 \| 19.5 \| 4.44 \| 12.6 \| 5.70 \| 18.6 \| 4.48 \| 12.7 \|
	\| Sydømål \| 7.09 \| 20.7 \| 6.38 \| 14.9 \| 6.96 \| 20.4 \| 6.44 \| 15.3 \|

	### Performance on Other Datasets

	The model was also tested against other datasets to evaluate generalizability:

	\| \| Roest-wav2vec2-315M-v1 \| \| Roest-wav2vec2-315M-v2 \| \|
	\| ------------------------------------------------------------------------------------- \| ----------- \| ----- \| ----------- \| -------- \|
	\| Evaluation Dataset \| WER % \| CER % \| WER % \| CER % \|
	\| [CoRal](https://huggingface.co/datasets/alexandrainst/coral/viewer/read_aloud/test) \| 17.0 \| 6.6 \| 16.3 \| 6.5 \|
	\| [NST-da](https://huggingface.co/datasets/alexandrainst/nst-da) \| 29.7 \| 13.9 \| 26.1 \| 11.9 \|
	\| [CommonVoice17](https://huggingface.co/datasets/mozilla-foundation/common_voice_17_0) \| 16.7 \| 6.6 \| 14.4 \| 5.4 \|
	\| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) \| 27.3 \| 7.9 \| 26.4 \| 7.7 \|
	\| [Fleurs-da_dk](https://huggingface.co/datasets/google/fleurs) Normed \| 16.6 \| 6.3 \| 15.6 \| 6.1 \|

	## Training curves
	<img src="https://huggingface.co/CoRal-dataset/roest-wav2vec2-315m-v2/resolve/main/images/training_plots.png">

	## Creators and Funders
	This model has been trained and the model card written by Marie Juhl Jørgensen and Søren Vejlgaard Holm at [Alvenir](https://www.alvenir.ai/).

	The CoRal project is funded by the [Danish Innovation Fund](https://innovationsfonden.dk/) and consists of the following partners:

	- [Alexandra Institute](https://alexandra.dk/)
	- [University of Copenhagen](https://www.ku.dk/)
	- [Agency for Digital Government](https://digst.dk/)
	- [Alvenir](https://www.alvenir.ai/)
	- [Corti](https://www.corti.ai/)

	We would like specifically thank Dan Saattrup Nielsen, Alexandra Institute for (among other things) the repository work and Simon Leminen Madsen, Alexandra Institute for modelling work.