Fine-tuning Parler TTS on a Specific Language

Community Article Published September 16, 2024

We fine-tuned Parler TTS mini v0.1 into a French version: parler-français-tts-mini-v0.1. This project shows how Parler TTS can be adapted to other languages.

What is Parler TTS?

Parler-TTS is a lightweight text-to-speech (TTS) model capable of generating high-quality, natural-sounding speech in the style of a given speaker (gender, tone, speaking style, etc.). It is a reproduction of the work presented in the paper Natural language guidance of high-fidelity text-to-speech with synthetic annotations by Dan Lyth and Simon King, from Stability AI and the University of Edinburgh, respectively.

The Parler TTS project is an open-source project initiated by Hugging Face. You can learn more about it here.

An Innovative Approach

Most current TTS models require two inputs: a text prompt (the script to be spoken) and a voice prompt (an audio sample that the model should emulate). This approach, while functional, has significant limitations:

  1. Lack of fine-grained control: The model must infer how to pronounce the text based on the context of the phrase and the audio sample provided. This makes it challenging, if not impossible, to specify desired emotions, tones, or speech rates.
  2. Limited customization: The model essentially decides how to speak on our behalf, based on its interpretation of the input. Users have little control over the nuances of the generated speech.
  3. Dependency on audio samples: Finding or creating appropriate audio samples for every desired voice characteristic can be time-consuming and limiting.
  4. Inconsistency: The generated speech may not consistently match the desired style across different texts or contexts.

Parler TTS takes a different approach. Instead of a voice prompt, it uses two text prompts: the text to be spoken and a description of how to speak it. This method offers several advantages:

  1. Precise control: Users can explicitly specify emotions, tones, speech rates, and other vocal characteristics.
  2. Flexibility: The same text can be easily rendered in multiple styles without needing different audio samples.
  3. Accessibility: This approach makes it easier to generate speech in styles for which audio samples might be scarce or unavailable.

The main challenge in this approach lies in preparing the training data and generating descriptions of the audio. The solution to that is implemented in the dataspeech project by Hugging Face.

Parler TTS Today

Parler TTS is a collection of models developed by Hugging Face, including Large and Mini versions, as well as two generations: v0.1 and V1. Our work focuses on fine-tuning Parler TTS v0.1 mini, but the methodology can be applied to other models, given more computational resources.

Our Approach

Choice of Base Model

We opted for fine-tuning Parler TTS v0.1 mini due to its reduced size, making training for new languages more accessible. The methodology used is replicable for all languages supported by FLAN-T5.

Fine-Tuning Process

  1. Dataset Selection:
    • Recommended minimum: 100 hours of audio (up to 1000 hours for optimal results)
    • Criteria: vocabulary diversity, homogeneous distribution of lengths (0.5-30 seconds), gender balance, audio quality, and accurate transcriptions
  2. Dataset Preparation:
    • Use of Hugging Face's dataspeech (note: you need to update the phonemizer for the target language).
    • Addition of "french" to descriptors (e.g., "french man" / "french woman") (note: we actually noticed better results for inference without the word "French". We will run more tests on the impact of that, but for now, you might want to keep the description without any mention of the language).
  3. Model Training:
    • Duration: less than 40 hours on a single NVIDIA H100
    • 55k steps completed
    • Notable results from 20k steps, significant improvement at 35k
  4. Challenges Encountered and Solutions:
    • Lack of punctuation in public datasets
    • Variable voice quality
    • Absence of gender annotations (deduced from pitch but you can find gender classification models on the hub)

Results and Limitations

  • Performance: Generation of quality French speech. The quality of the generated text is similar to the quality of the speech in the dataset.
  • Limitations:
    1. Difficulties with words underrepresented in the dataset
    2. The lack of female voices in the dataset (less than 5%) caused the model to underperform for female audio
    3. Loss of ability to speak English (model is French-only)

Important Note: We observed that the model's performance is significantly better when the speaker's nationality is not specified (no "french man"). We therefore recommend not including nationality in either the training data or the inference prompts. It would be interesting to retrain the model without mentioning nationality to evaluate the impact on the French model's performance.

Speech Synthesis Examples

Below is a comparison table of audio samples generated by the original Parler TTS model and our fine-tuned French model.

Text Original Model Fine-tuned French Model
Victor Hugo:
"La voix humaine est un instrument de musique au-dessus de tous les autres."

(The human voice is a musical instrument above all others.)
Jules Verne:
"Tout ce qu'un homme est capable d'imaginer, d'autres hommes seront capables de le réaliser."

(Whatever one man is capable of conceiving, other men will be able to achieve.)
Antoine de Saint-Exupéry:
"La machine elle-même, si perfectionnée qu'on la suppose, n'est qu'un outil."

(The machine itself, however perfect one might imagine it, is merely a tool.)
Voltaire:
"Le progrès fait naître plus de besoins qu'il n'en satisfait."

(Progress creates more needs than it satisfies.)

Conclusion and Future Prospects

Although we have succeeded in creating a functional French version, the main limitations for further progress are related to the quality and quantity of available open-source datasets.

Envisioned next steps:

  1. Consolidate multiple open-source datasets
  2. Test a multilingual model trained from several base languages (non-fine-tuned)
  3. Encourage dataset annotation (using the dataspeech method)
  4. Create versions for other languages
  5. Fine-tune larger models

Resources

Acknowledgments

We would like to warmly thank Flexai for providing access to their training cloud on which this model was trained. We also express our gratitude to the Hugging Face community and the Parler TTS team for their fundamental work, and a special thanks to ylacombe for his advice.