Update README.md

c69c29b verified about 2 months ago

7.48 kB

	---
	language:
	- ca
	datasets:
	- projecte-aina/3catparla_asr
	- projecte-aina/corts_valencianes_asr_a
	tags:
	- audio
	- automatic-speech-recognition
	- whisper-large-v3
	- barcelona-supercomputing-center
	license: apache-2.0
	model-index:
	- name: whisper-3cat-cv21-valencian
	results:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: CV Benchmark Catalan Accents (Valencian fem)
	type: projecte-aina/commonvoice_benchmark_catalan_accents
	split: Valencian female
	args:
	language: ca
	metrics:
	- name: WER
	type: wer
	value:
	- task:
	name: Automatic Speech Recognition
	type: automatic-speech-recognition
	dataset:
	name: CV Benchmark Catalan Accents (Valencian male)
	type: projecte-aina/commonvoice_benchmark_catalan_accents
	split: Valencian male
	args:
	language: ca
	metrics:
	- name: WER
	type: wer
	value:

	library_name: transformers
	base_model:
	- openai/whisper-large-v3
	metrics:
	- wer
	---
	# whisper-3cat-cv21-valencian

	## Table of Contents
	<details>
	<summary>Click to expand</summary>

	- [Model Description](#model-description)
	- [Intended Uses and Limitations](#intended-uses-and-limitations)
	- [How to Get Started with the Model](#how-to-get-started-with-the-model)
	- [Training Details](#training-details)
	- [Citation](#citation)
	- [Additional Information](#additional-information)

	</details>


	## Model Description

	The "BSC-LT/whisper-3cat-cv21-valencian" is an acoustic model suitable for Automatic Speech Recognition in Valencian. It is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) on 256 hours coming from the training splits "clean" and "other" of the dataset [Corts Valencianes](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) and 140 hours of the splits "train" and "validated" of the [Common Voice v21](https://commonvoice.mozilla.org/es/datasets). In particular, we selected only the recordings labeled as "valencian" in reference to the Valencian accent. In summary, the total amount of training data is 397 hours and 55 minutes.

	## Intended Uses and Limitations

	This model can be used for Automatic Speech Recognition (ASR) in Catalan, especially in the Valencian accent. The model intends to transcribe Catalan audio files to plain text without punctuation.

	<!--
	## How to Get Started with the Model

	To see an updated and functional version of this code, please visit our [Notebook](https://colab.research.google.com/drive/1MHiPrffNTwiyWeUyMQvSdSbfkef_8aJC?usp=sharing)
	-->
	### Installation

	To use this model, you may install [datasets](https://huggingface.co/docs/datasets/installation) and [transformers](https://huggingface.co/docs/transformers/installation):

	Create a virtual environment:
	```bash
	python -m venv /path/to/venv
	```
	Activate the environment:
	```bash
	source /path/to/venv/bin/activate
	```
	Install the modules:
	```bash
	pip install datasets transformers
	```

	### For Inference
	To transcribe audio in Catalan using this model, you can follow this example:

	```bash
	#Install Prerequisites
	pip install torch
	pip install datasets
	pip install 'transformers[torch]'
	pip install evaluate
	pip install jiwer
	```

	```python
	#This code works with GPU

	#Notice that: load_metric is no longer part of datasets.
	#You have to remove it and use evaluate's load instead.
	#(Note from November 2024)

	import torch
	from transformers import WhisperForConditionalGeneration, WhisperProcessor

	#Load the processor and model.
	MODEL_NAME="BSC-LT/whisper-3cat-cv21-valencian"
	processor = WhisperProcessor.from_pretrained(MODEL_NAME)
	model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

	#Load the dataset
	from datasets import load_dataset, load_metric, Audio
	ds=load_dataset("projecte-aina/parlament_parla",split='test')

	#Downsample to 16 kHz
	ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

	#Process the dataset
	def map_to_pred(batch):
	audio = batch["audio"]
	input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
	batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

	with torch.no_grad():
	predicted_ids = model.generate(input_features.to("cuda"))[0]

	transcription = processor.decode(predicted_ids)
	batch["prediction"] = processor.tokenizer._normalize(transcription)

	return batch

	#Do the evaluation
	result = ds.map(map_to_pred)

	#Compute the overall WER now.
	from evaluate import load

	wer = load("wer")
	WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
	print(WER)
	```

	## Training Details

	### Training data

	The specific datasets used to create the model are:
	- Training: [Corts Valencianes](https://huggingface.co/datasets/projecte-aina/corts_valencianes_asr_a) (split clean=131 hours; split other=125 hours) and [Common Voice v21](https://commonvoice.mozilla.org/es/datasets) (split train=65 hours ; split validated=75 hours).
	- Validation: [3CatParla](https://huggingface.co/datasets/projecte-aina/3catparla_asr) (split dev=4 hours and 28 minutes) (Soon to be published).

	### Training procedure

	This model is the result of finetuning the model ["openai/whisper-large-v3"](https://huggingface.co/openai/whisper-large-v3) by following this [tutorial](https://huggingface.co/blog/fine-tune-whisper) provided by Hugging Face.

	### Training Hyperparameters

	* language: Catalan (Valencian Accent)
	* hours of training audio: 397 hours and 55 minutes
	* learning rate: 1e-5
	* sample rate: 16000
	* train batch size: 32
	* eval batch size: 32
	* num_train_epochs: 10

	## Citation

	If this model contributes to your research, please cite the work:
	<!--
	```bibtex
	@inproceedings{hernandez20243catparla,
	title={3CatParla: A New Open-Source Corpus of Broadcast TV in Catalan for Automatic Speech Recognition},
	author={Hern{\'a}ndez Mena, Carlos Daniel and Armentano Oller, Carme and Solito, Sarah and K{\"u}lebi, Baybars},
	booktitle={Proc. IberSPEECH 2024},
	pages={176--180},
	year={2024}
	}
	```
	-->
	```bibtext
	@misc{BSC2025-whisper3catcv21valencian,
	title={Recognition models for adaptation to Catalan variants},
	author={Hernandez Mena, Carlos Daniel; Messaoudi, Abir; Armentaro Carme; España i Bonet, Cristina;},
	organization={Barcelona Supercomputing Center},
	url={https://huggingface.co/BSC-LT/whisper-3cat-cv21-valencian},
	year={2025}
	}
	```

	## Additional Information

	### Author

	The fine-tuning process was performed during June (2025) in the [Language Technologies Laboratory](https://huggingface.co/BSC-LT) of the [Barcelona Supercomputing Center](https://www.bsc.es/).

	### Contact
	For further information, please email <[email protected]>.

	### Copyright
	Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center.

	### License

	[Apache-2.0](https://www.apache.org/licenses/LICENSE-2.0)

	### Funding
	This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

	The training of the model was possible thanks to the computing time provided by [Barcelona Supercomputing Center](https://www.bsc.es/) through MareNostrum 5.

	We acknowledge EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.