|
--- |
|
library_name: transformers |
|
license: other |
|
license_name: meralion-public-license |
|
license_link: https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf |
|
tags: |
|
- speech |
|
- best-rq |
|
- meralion |
|
language: |
|
- en |
|
--- |
|
|
|
# MERaLiON-SpeechEncoder-v1 |
|
|
|
The MERaLiON-SpeechEncoder is a speech foundation model designed to support a wide range of downstream speech applications, like speech recognition, intent classification and speaker identification, among others. This version was trained on **200,000 hours of predominantly English data including 10,000 hours of Singapore-based speech**, to cater to the speech processing needs in Singapore and beyond. Gradual support for other languages, starting with major Southeast Asian ones are planned for subsequent releases. |
|
|
|
- **Developed by:** I<sup>2</sup>R, A\*STAR |
|
- **Model type:** Speech Encoder |
|
- **Language(s):** Primarily English (Global & Singapore) |
|
- **License:** [MERaLiON Public License](https://huggingface.co/MERaLiON/MERaLiON-AudioLLM-Whisper-SEA-LION/blob/main/MERaLiON-Public-Licence-v1.pdf) |
|
|
|
For details on background, pre-training, tuning experiments and evaluation, please refer to our [technical report](https://arxiv.org/abs/2412.11538). |
|
|
|
## Acknowledgement |
|
This research is supported by the National Research Foundation, Singapore and Infocomm Media Development Authority, Singapore under its National Large Language Models Funding Initiative. |
|
|
|
## Model Description |
|
|
|
<img src="bestrq_model_training.png" alt="model_architecture" width="400" style="margin-left:'auto' margin-right:'auto' display:'block'"/> |
|
|
|
MERaLiON-SpeechEncoder was pre-trained from scratch with a self-supervised learning approach using a **BERT-based speech pre-training with random-projection quantizer (BEST-RQ)** objective. Analogous to BERT's mask language modelling criterion for text, this entails predicting the correct discrete label from a codebook, over the masked frames of an input speech signal. MERaLiON-SpeechEncoder-v1 contains approximately 630M parameters. |
|
|
|
The model takes in speech as input in the form of mel-spectrograms and returns compressed latent features which can then be passed to a task-specific downstream model, relevant to the user's application. Note that the model provided here is the base foundation model itself and the user has to fine-tune the model with task-specific data for a complete inference pipeline. We provide some examples below to get one started. |
|
|
|
## Capabilities |
|
|
|
We have evaluated the MERaLiON-SpeechEncoder extensively on several speech recognition datasets, and fine-tuned the model on ten different tasks encompassing the [SUPERB](https://superbbenchmark.org/) benchmark: `automatic speech recognition` (ASR), `automatic phoneme recognition` (PR), `keyword spotting` (KS), `query by example spoken term detection` (QbE), `intent classification` (IC), `slot filling` (SF), `speaker identification` (SID), `automatic speaker verification` (ASV), `speaker diarization` (SD), and `emotion recognition` (ER). Our evaluation demonstrates improvements to spontaneous and Singapore speech benchmarks for speech recognition, while remaining competitive to other state-of-the-art speech encoders such as WavLM and HuBERT across SUPERB tasks. |
|
|
|
This version of the MERaLiON-SpeechEncoder is specifically tailored for English, both global and Singapore-specific, including Singlish. Although the encoder was trained on a portion of multilingual data, this has not been substantially evaluated. |
|
|
|
We provide a code snippet below for the direct usage of retrieving latent features from the model, followed by an example of how to set up the model for ASR fine-tuning. Speech input should be sampled at 16kHz. |
|
|
|
### Get Features |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoModel, AutoFeatureExtractor |
|
|
|
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1' |
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
|
|
# load model and feature extractor |
|
model = AutoModel.from_pretrained( |
|
repo_id, |
|
trust_remote_code=True, |
|
) |
|
model = model.to(device) |
|
|
|
feature_extractor = AutoFeatureExtractor.from_pretrained( |
|
repo_id, |
|
trust_remote_code=True |
|
) |
|
|
|
# prepare data |
|
data = load_dataset("distil-whisper/librispeech_long", "clean", |
|
split="validation") |
|
|
|
def batch_collater(data): |
|
tensors = [] |
|
for idx, sample in enumerate(data): |
|
tensors.append(sample['audio']['array']) |
|
return tensors |
|
|
|
audio_array = batch_collater(data) |
|
inputs = feature_extractor(audio_array, sampling_rate=16_000, |
|
return_attention_mask=True, |
|
return_tensors='pt', do_normalize=False) |
|
input_values = inputs['input_values'] |
|
input_lengths = torch.sum(inputs['attention_mask'], dim=-1) |
|
|
|
input_values, input_lengths = input_values.to(device), input_lengths.to(device) |
|
|
|
# model inference to obtain features |
|
with torch.no_grad(): |
|
model.eval() |
|
output = model(input_values=input_values, |
|
input_lengths=input_lengths, |
|
output_hidden_states=True) |
|
``` |
|
|
|
### Downstream Use |
|
|
|
```python |
|
import torch |
|
import json |
|
from datasets import load_dataset |
|
from transformers import AutoModelForCTC, AutoFeatureExtractor, Wav2Vec2CTCTokenizer |
|
|
|
repo_id = 'MERaLiON/MERaLiON-SpeechEncoder-v1' |
|
device = 'cuda' if torch.cuda.is_available() else 'cpu' |
|
|
|
# prepare data |
|
def pre_processing(batch): |
|
batch["text"] = batch["text"].lower() |
|
return batch |
|
|
|
def extract_all_chars(batch): |
|
all_text = " ".join(batch["text"]) |
|
vocab = list(set(all_text)) |
|
return {"vocab": [vocab], "all_text": [all_text]} |
|
|
|
librispeech100h_train = load_dataset("openslr/librispeech_asr", split="train.clean.100") |
|
librispeech100h_test = load_dataset("openslr/librispeech_asr", split="validation.clean") |
|
librispeech100h_train = librispeech100h_train.remove_columns( |
|
['file', 'speaker_id', 'chapter_id', 'id']) |
|
librispeech100h_test = librispeech100h_test.remove_columns( |
|
['file', 'speaker_id', 'chapter_id', 'id']) |
|
|
|
librispeech100h_train = librispeech100h_train.map(pre_processing) |
|
librispeech100h_test = librispeech100h_test.map(pre_processing) |
|
|
|
vocab_train = librispeech100h_train.map(extract_all_chars, batched=True, |
|
batch_size=-1, keep_in_memory=True, |
|
remove_columns=librispeech100h_train.column_names) |
|
vocab_test = librispeech100h_test.map(extract_all_chars, batched=True, |
|
batch_size=-1, keep_in_memory=True, |
|
remove_columns=librispeech100h_test.column_names) |
|
vocab_list = list(set(vocab_train["vocab"][0]) | set(vocab_test["vocab"][0])) |
|
vocab_dict = {v: k for k, v in enumerate(sorted(vocab_list))} |
|
|
|
vocab_dict["|"] = vocab_dict[" "] |
|
del vocab_dict[" "] |
|
vocab_dict["[UNK]"] = len(vocab_dict) |
|
vocab_dict["[PAD]"] = len(vocab_dict) |
|
|
|
with open('ls_vocab.json', 'w') as vocab_file: |
|
json.dump(vocab_dict, vocab_file) |
|
|
|
# load model, feature extractor and tokenizer |
|
feature_extractor = AutoFeatureExtractor.from_pretrained( |
|
repo_id, |
|
trust_remote_code = True, |
|
) |
|
|
|
tokenizer = Wav2Vec2CTCTokenizer("./ls_vocab.json", |
|
unk_token="[UNK]", pad_token="[PAD]", |
|
word_delimiter_token="|") |
|
|
|
model = AutoModelForCTC.from_pretrained( |
|
repo_id, |
|
trust_remote_code=True, |
|
vocab_size=len(vocab_dict), |
|
feat_proj_dropout=0.1, |
|
activation_dropout=0.1, |
|
hidden_dropout=0.1, |
|
conformer_conv_dropout=0.1, |
|
ctc_loss_reduction="mean", |
|
pad_token_id=tokenizer.pad_token_id, |
|
attention_dropout=0.1, |
|
) |
|
model = model.to(device) |
|
``` |
|
Please refer to this [blog](https://huggingface.co/blog/fine-tune-w2v2-bert) for further ASR fine-tuning recipe with Huggingface Trainer. Alternatively, the Huggingface model can be loaded to any other frameworks such as Pytorch or ESPnet for custom fine-tuning loops. |
|
|
|
## Technical Specifications |
|
|
|
### Training Data |
|
|
|
MERaLiON-SpeechEncoder has been trained on a diverse set of unsupervised speech data, primarily in English. Our collection is curated from various publicly available datasets and covers a wide range of conditions, encompassing factors such as domain, style, speaker, gender, and accent. The combined dataset comprises around 170,000 hours of English, including 10,000 hours of Singapore-based English that incorporates code-switching; plus 30,000 additional hours of multilingual speech from 113 languages, totalling 200,000 hours. Consult our technical report for the full breakdown. |
|
|
|
### Training Procedure and Compute |
|
|
|
MERaLiON-SpeechEncoder was trained in two phases, initially on a 60,000 hours subset of data, before continued pre-trainining on the full 200,000 hours dataset using this prior checkpoint as initialisation. The initial model was trained on the **ASPIRE 2A** Supercomputer Cluster provided by the **National Supercomputing Centre (NSCC)** for 325K steps on 12 Nvidia A100 40GB GPUs. The full pre-training run was carried out on the **LUMI** Supercomputer Cluster with 128 AMD MI250x GPUs for a further 382K steps taking about 25 days of active GPU time. |
|
|
|
## Citation |
|
|
|
If you find our work useful, please cite our technical report: |
|
|
|
``` |
|
@misc{huzaifah2024speechfoundationmodelsingapore, |
|
title={MERaLiON-SpeechEncoder: Towards a Speech Foundation Model for Singapore and Beyond}, |
|
author={{MERaLiON Team}}, |
|
year={2024}, |
|
eprint={2412.11538}, |
|
archivePrefix={arXiv}, |
|
primaryClass={cs.CL}, |
|
url={https://arxiv.org/abs/2412.11538}, |
|
} |
|
``` |