AbirMessaoudi's picture
Update README.md
49fc7bd verified
metadata
language:
  - ca
datasets:
  - projecte-aina/3catparla_asr
tags:
  - audio
  - automatic-speech-recognition
  - whisper-large-v3
  - barcelona-supercomputing-center
license: apache-2.0
model-index:
  - name: whisper-3cat-balearic
    results:
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: 3CatParla (Test)
          type: projecte-aina/3catparla_asr
          split: test
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 0.84
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic fem)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic female
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 5.61
      - task:
          name: Automatic Speech Recognition
          type: automatic-speech-recognition
        dataset:
          name: CV Benchmark Catalan Accents (Balearic male)
          type: projecte-aina/commonvoice_benchmark_catalan_accents
          split: Balearic male
          args:
            language: ca
        metrics:
          - name: WER
            type: wer
            value: 4.55
library_name: transformers
base_model:
  - openai/whisper-large-v3
metrics:
  - wer

whisper-3cat-balearic

Table of Contents

Click to expand

Model Description

The "BSC-LT/whisper-3cat-balearic" is an acoustic model suitable for Automatic Speech Recognition in Balearic. It is the result of finetuning the model "openai/whisper-large-v3" on the split called "perfect_matches" of the corpus 3catparla_asr, a dataset of broadcasted Catalan TV shows manually transcribed. This particular split is comprised of 90 hours of speech data.

Intended Uses and Limitations

This model can be used for Automatic Speech Recognition (ASR) in Catalan, especially in the Balearic accent. The model intends to transcribe Catalan audio files to plain text without punctuation.

Installation

To use this model, you may install datasets and transformers:

Create a virtual environment:

python -m venv /path/to/venv

Activate the environment:

source /path/to/venv/bin/activate

Install the modules:

pip install datasets transformers 

For Inference

To transcribe audio in Catalan using this model, you can follow this example:

#Install Prerequisites
pip install torch
pip install datasets
pip install 'transformers[torch]'
pip install evaluate
pip install jiwer
#This code works with GPU

#Notice that: load_metric is no longer part of datasets.
#You have to remove it and use evaluate's load instead.
#(Note from November 2024)

import torch
from transformers import WhisperForConditionalGeneration, WhisperProcessor

#Load the processor and model.
MODEL_NAME="BSC-LT/whisper-3cat-balearic"
processor = WhisperProcessor.from_pretrained(MODEL_NAME)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_NAME).to("cuda")

#Load the dataset
from datasets import load_dataset, load_metric, Audio
ds=load_dataset("projecte-aina/parlament_parla",split='test')

#Downsample to 16 kHz
ds = ds.cast_column("audio", Audio(sampling_rate=16_000))

#Process the dataset
def map_to_pred(batch):
    audio = batch["audio"]
    input_features = processor(audio["array"], sampling_rate=audio["sampling_rate"], return_tensors="pt").input_features
    batch["reference"] = processor.tokenizer._normalize(batch['normalized_text'])

    with torch.no_grad():
        predicted_ids = model.generate(input_features.to("cuda"))[0]
    
    transcription = processor.decode(predicted_ids)
    batch["prediction"] = processor.tokenizer._normalize(transcription)
    
    return batch
    
#Do the evaluation
result = ds.map(map_to_pred)

#Compute the overall WER now.
from evaluate import load

wer = load("wer")
WER=100 * wer.compute(references=result["reference"], predictions=result["prediction"])
print(WER)

Training Details

Training data

The specific datasets used to create the model are:

  • Training: 3CatParla. (Soon to be published)
  • Validation: IB3 (Soon to be published)

Training procedure

This model is the result of finetuning the model "openai/whisper-large-v3" by following this tutorial provided by Hugging Face.

Training Hyperparameters

  • language: Catalan (Balearic Accent)
  • hours of training audio: 90 hours
  • learning rate: 1e-6
  • sample rate: 16000
  • train batch size: 32
  • eval batch size: 32
  • num_train_epochs: 20

Citation

If this model contributes to your research, please cite the work:

@misc{BSC2025-whisper3catbalearic,
      title={Recognition models for adaptation to Catalan variants}, 
      author={Hernandez Mena, Carlos Daniel; Messaoudi, Abir; Armentaro Carme; España i Bonet, Cristina;},
      organization={Barcelona Supercomputing Center},
      url={https://huggingface.co/BSC-LT/whisper-3cat-balearic},
      year={2025}
}

Additional Information

Author

The fine-tuning process was performed during June (2025) in the Language Technologies Laboratory of the Barcelona Supercomputing Center.

Contact

For further information, please email [email protected].

Copyright

Copyright(c) 2025 by Language Technologies Laboratory, Barcelona Supercomputing Center.

License

Apache-2.0

Funding

This work is funded by the Ministerio para la Transformación Digital y de la Función Pública - Funded by EU – NextGenerationEU within the framework of the project ILENIA with reference 2022/TL22/00215337.

The training of the model was possible thanks to the computing time provided by Barcelona Supercomputing Center through MareNostrum 5.

We acknowledge the EuroHPC Joint Undertaking for awarding us access to MareNostrum5 as BSC, Spain.