HiTZ/Aholab's Bilingual Basque Spanish Speech-to-Text model Conformer-Transducer for IBERSPEECH 2024's BBS-S2TC

Model Description

| | | |

This model was specifically designed for a submission in the BBS-S2TC (Bilingual Basque Spanish Speech to Text Challenge) from the IBERSPEECH 2024 Albayzin evalutaions chalenges section. The train was fitted for a good performance on the challenge's evaluation splits, therefore, the performance in other splits is worse.

This model transcribes speech in lowercase Spanish alphabet including spaces, and was trained on a composite dataset comprising of 1462 hours of Spanish and Basque speech. The model was fine-tuned from a pre-trained Basque stt_eu_conformer_transducer_large model using the Nvidia NeMo toolkit. It is an autoregressive "large" variant of Conformer, with around 119 million parameters. See the model architecture section and NeMo documentation for complete architecture details.

Usage

To train, fine-tune or play with the model you will need to install NVIDIA NeMo. We recommend you install it after you've installed latest PyTorch version.

pip install nemo_toolkit['all']

Transcribing using Python

Clone repository to download the model:

git clone https://huggingface.co/HiTZ/BBS-S2TC_conformer_transducer_large

Given NEMO_MODEL_FILEPATH is the path that points to the downloaded BBS-S2TC_conformer_transducer_large.nemo file.

import nemo.collections.asr as nemo_asr

# Load the model
asr_model = nemo_asr.models.EncDecRNNTBPEModel.restore_from(NEMO_MODEL_FILEPATH)

# Create a list pointing to the audio files
audio = ["audio_1.wav","audio_2.wav", ..., "audio_n.wav"]

# Fix the batch_size to whatever number suits your purpouse
batch_size = 8

# Transcribe the audio files
transcriptions = asr_model.transcribe(audio=audio, batch_size=batch_size)

# Visualize the transcriptions
print(transcriptions)

Input

This model accepts 16000 kHz Mono-channel Audio (wav files) as input.

Output

This model provides transcribed speech as a string for a given audio sample.

Model Architecture

Conformer-Transducer model is an autoregressive variant of Conformer model [1] for Automatic Speech Recognition which uses Transducer loss/decoding instead of CTC loss. You may find more info on the detail of this model here: Conformer-Transducer Model.

Training

Data preparation

This model has been trained the bilingual dataset basque_parliament comprising 1462 hours of Spanish and Basque speech from the basque parliament's sessions.

Training procedure

This model was trained starting from the pre-trained Basque model stt_eu_conformer_transducer_large over several hundred of epochs in a GPU device, using the NeMo toolkit [3] The tokenizer for these model was built using the text transcripts of the train dataset with this script, with a total of 128 spanish and basque language tokens.

Performance

Performances of the ASR models are reported in terms of Word Error Rate (WER%) with greedy decoding in the following table.

Tokenizer	Vocabulary Size	MCV 18.0 Test ES	MCV 18.1 Test EU	Basque Parliament Test ES	Basque Parliament Test EU	Basque Parliament Test BI	MLS Test ES	VoxPopuli ES	Train Dataset
SentencePiece Unigram	128	14.52	7.22	2.18	3.8	4.51	7.84	10.29	Basque Palriament (1462 h)

Limitations

Since this model was trained on publicly available speech datasets, the performance of this model might degrade for speech which includes technical terms, or vernacular that the model has not been trained on. The model might also perform worse for accented speech.

Aditional Information

Author

HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.

Licensing Information

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Citation

If you use this model please cite:

BibTex

@inproceedings{herranz2024hitz,
  title     = {HiTZ-AhoLab ASR System for the Albayzin Bilingual Basque-Spanish Speech to Text Challenge},
  author    = {Herranz, Asier and Garc{\'i}a-Sebasti{\'a}n, Adri{\'a}n and Souganidis, Christoforos and Garc{\'i}a-Romillo, Victor and Bellanco, Aitor and Navas, Eva and Hern{\'a}ez-Rioja, Inma and Saratxaga, Ibon},
  booktitle = {Proceedings of IberSPEECH 2024},
  year      = {2024},
  address   = {Aveiro, Portugal},
  pages     = {315--318},
  doi       = {10.21437/IberSPEECH.2024-66}
}

Funding

This project with reference 2022/TL22/00215335 has been parcially funded by the Ministerio de Transformación Digital and by the Plan de Recuperación, Transformación y Resiliencia – Funded by the European Union – NextGenerationEU ILENIA and by the project IkerGaitu funded by the Basque Government. This model was trained at Hyperion, one of the high-performance computing (HPC) systems hosted by the DIPC Supercomputing Center.

References

Disclaimer

Click to expand

The models published in this repository are intended for a generalist purpose and are available to third parties. These models may have bias and/or any other undesirable distortions.

When third parties, deploy or provide systems and/or services to other parties using any of these models (or using systems based on these models) or become users of the models, they should note that it is their responsibility to mitigate the risks arising from their use and, in any event, to comply with applicable regulations, including regulations regarding the use of Artificial Intelligence.

In no event shall the owner and creator of the models (HiTZ Basque Center for Language Technology - Aholab Signal Processing Laboratory, University of the Basque Country UPV/EHU.) be liable for any results arising from the use made by third parties of these models.

Downloads last month: 9

Collections including HiTZ/BBS-S2TC_conformer_transducer_large

Speech to Text

Collection

Basque Speech to Text models • 5 items • Updated 15 days ago

Nvidia NeMo

Collection

Nvidia NeMo STT models • 5 items • Updated 8 days ago

Paper for HiTZ/BBS-S2TC_conformer_transducer_large

Conformer: Convolution-augmented Transformer for Speech Recognition

Paper • 2005.08100 • Published May 16, 2020 • 1

Evaluation results

Test WER on Mozilla Common Voice 18.1 EU
test set self-reported

7.220
Test WER on Mozilla Common Voice 18.1 ES
test set self-reported

14.520
Test WER on Basque Parliament EU
test set self-reported

3.800
Test WER on Basque Parliament ES
test set self-reported

2.180
Test WER on Basque Parliament BI
test set self-reported

4.510
Test WER on Multi Lingual Librispeech ES
test set self-reported

7.840
Test WER on Facebook Voxpopuli ES
test set self-reported

10.290