WavLM Base+ French Italian Phonemizer
WARNING: this is an early work. The model training is not finished (better performances to be expected), and everything was made with PyTorch with low integration to Hugging Face. Pull Requests, comments and discussions are welcome!
It is a phonemization model, that works both for French and Italian. Given an audio file, it will output the words heard using IPA. It does not use a language model, so it has a low likelihood of trying to map an audio on existing words.
Model Details
- Developed by: HugoFara
- Funded by: NCCR Evolving Language
The training was conducted as a part of the NCCR Evolving Language group, a Swiss research institute on language.
Uses
The model works with French and Italian audios. Currently, everything is managed through PyTorch. Let's transcribe this audio:
You can use the following code.
"""
Simple demonstration.
See main.py for a more complete demonstration.
"""
import json
import torch
import torchaudio
import transformers
import phoneme_recognizer
# Load the model with weights
with open("vocab.json", "r") as file:
phonemes_dict = json.load(file)
model = phoneme_recognizer.PhonemeRecognizer(phonemes_dict=phonemes_dict)
checkpoint = torch.load("model.pth")
model.load_state_dict(checkpoint)
# Prepare the input data
SAMPLING_RATE = 16_000
audio_array, frequency = torchaudio.load("audio-samples/tsenkher-fr.wav")
if frequency != SAMPLING_RATE:
raise ValueError(f"Input audio frequency should be {SAMPLING_RATE} Hz, it it {frequency} Hz.")
feature_extractor = transformers.AutoFeatureExtractor.from_pretrained(
"microsoft/wavlm-base-plus"
)
inputs = feature_extractor(
audio_array.squeeze(),
sampling_rate=SAMPLING_RATE,
padding=True,
return_tensors="pt",
)
inputs["language"] = "fr" # or "it"
# Do inference
with torch.no_grad():
logits = model(**inputs)
prediction = model.classify_to_phonemes(logits)[0]
print("Final phonemes are:", "".join(prediction))
# Should output: "sakapitalɛtsɑ̃kɛʁ"
Intended public
This model was mainly thought for clinicians that need audio transcriptions on a great volume of data. As the training was conducted on adult voices, it has the same speech recognition biases as "normal" adult voices, which means it corrects accents as long as they are well spread.
Do not use this model for any harmful purpose.
Training Details
Training Data
The dataset was adapted from Common Voice 17.0, French + Italian versions. To get an API representation of the sentences, a phonemizer from text was used: charsiu/g2p_multilingual_byT5_small_100. The language of each sample (either French or Italian) was also saved as a dataset feature.
Training Procedure
Only the training split of Common Voice 17.0 is used during training.
First, only the linear classifier was trained. We freeze both the weights of the feature encoder and the transformer. We use a tri-state linear warm-up for simplicity. The loss used is a CTC loss, and the evaluation metric is the Phoneme Error Rate (PER). Once the PER decreases below 60%, the initial training stops. Due to the size of the dataset, one epoch is enough.
For the second phase of training, we unfreeze the transformer. We start the same training procedure, a tri-state linear warm-up from scratch. At the time of writing, the model only completed a single epoch of training.
Evaluation
The results are measure in Phoneme Error Rate, PER for short. Using the validation set of Common Voice 17.0, we achieve less than 13% of PER.
Technical Specifications
The model contains WavLM Base+, with a linear classifier on top.
This linear classifier has the following input:
- The first input is the language (0 for French, 1 for Italian).
- The next 768 are the raw outputs of WavLM Base+.
To get phonemes from this output, you can simply use an arg max and map the indices over
vocab.json
.
Related works
The model was created as a successor, and an extension, to Cnam-LMSSC/wav2vec2-french-phonemizer. The main differences are a more modern base model (WavLM Base + vs Wav2Vec 2.0), and a different training procedure.
But wait, PER on Cnam-LMSSC/wav2vec2-french-phonemizer is 5%, here it is 12%, isn't that worse?
Not the same kind of measurement. On the previous model, PER is measured on the training set (with a risk of overfitting), while our PER is on some data the model never saw. For reference, we achieved 2% PER on the training set with 100 epochs, yet it was still 18% PER on the validation set.
See also this very good multilanguage version: ASR-Project/Multilingual-PR.
Todo list
- Data augmentation to finish the model training
- Cleaner dataset with a better phonemizer.
- More powerful model using WavLM Large.
- More evaluation results.
Model tree for hugofara/wavlm-base-plus-phonemizer-fr-it
Base model
microsoft/wavlm-base-plusDataset used to train hugofara/wavlm-base-plus-phonemizer-fr-it
Evaluation results
- Phoneme Error Rate (PER, %) on Mozilla Common Voice 17.0self-reported15.590
- Phoneme Error Rate (PER, %) on Mozilla Common Voice 17.0test set self-reported12.730