WavLM Base+ French Italian Phonemizer

WARNING: this is an early work. The model training is not finished (better performances to be expected), and everything was made with PyTorch with low integration to Hugging Face. Pull Requests, comments and discussions are welcome!

It is a phonemization model, that works both for French and Italian. Given an audio file, it will output the words heard using IPA. It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.

Model Details

Developed by: HugoFara
Funded by: NCCR Evolving Language

The training was conducted as a part of the NCCR Evolving Language group, a Swiss research institute on language.

Uses

The model works with French and Italian audios. Currently, everything is managed through PyTorch. Let's transcribe this audio:

You can use the following code.

"""
Simple demonstration.
See main.py for a more complete demonstration.
"""
import json

import torch
import torchaudio
import transformers

import phoneme_recognizer

# Load the model with weights
with open("vocab.json", "r") as file:
    phonemes_dict = json.load(file)

model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
checkpoint = torch.load("model.pth")
model.load_state_dict(checkpoint)

# Prepare the input data
SAMPLING_RATE = 16_000

audio_array, frequency = torchaudio.load("audio-samples/tsenkher-fr.wav")
if frequency != SAMPLING_RATE:
    raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
    "microsoft/wavlm-base-plus"
)
inputs = feature_extractor(
    audio_array.squeeze(),
    sampling_rate=SAMPLING_RATE,
    padding=True,
    return_tensors="pt",
)
inputs["language"] = "fr"  # or "it"

# Do inference
with torch.no_grad():
    logits = model(**inputs)

prediction = model.classify_to_phonemes(logits)[0]
print("Final phonemes are:", "".join(prediction))
# Should output: "sakapitalɛtsɑ̃kɛʁ"

Intended public

This model was mainly thought for clinicians that need audio transcriptions on a great volume of data. As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices, which means it corrects accents as long as they are well spread.

Do not use this model for any harmful purpose.

Training Details

Training Data

The dataset was adapted from Common Voice 17.0, French + Italian versions. To get an API representation of the sentences, a phonemizer from text was used: charsiu/g2p_multilingual_byT5_small_100. The language of each sample (either French or Italian) was also saved as a dataset feature.

Training Procedure

Only the training split of Common Voice 17.0 is used during training.

First, only the linear classifier was trained. We freeze both the weights of the feature encoder and the transformer. We use a tri-state linear warm-up for simplicity. The loss used is a CTC loss, and the evaluation metric is the Phoneme Error Rate (PER). Once the PER decreases below 60%, the initial training stops. Due to the size of the dataset, one epoch is enough.

For the second phase of training, we unfreeze the transformer. We start the same training procedure, a tri-state linear warm-up from scratch. At the time of writing, the model only completed a single epoch of training.

Evaluation

The results are measure in Phoneme Error Rate, PER for short. Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.

Technical Specifications

The model contains WavLM Base+, with a linear classifier on top.

This linear classifier has the following input:

The first input is the language (0 for French, 1 for Italian).
The next 768 are the raw outputs of WavLM Base+.

To get phonemes from this output, you can simply use an arg max and map the indices over vocab.json.

Related works

The model was created as a successor, and an extension, to Cnam-LMSSC/wav2vec2-french-phonemizer. The main differences are a more modern base model (WavLM Base + vs Wav2Vec 2.0), and a different training procedure.

But wait, PER on Cnam-LMSSC/wav2vec2-french-phonemizer is 5%, here it is 12%, isn't that worse?

Not the same kind of measurement. On the previous model, PER is measured on the training set (with a risk of overfitting), while our PER is on some data the model never saw. For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.

See also this very good multilanguage version: ASR-Project/Multilingual-PR.

Todo list

Data augmentation to finish the model training
Cleaner dataset with a better phonemizer.
More powerful model using WavLM Large.
More evaluation results.

hugofara
/

wavlm-base-plus-phonemizer-fr-it