LJSpeech Finetuned StyleTTS 2

This repository hosts checkpoints of a StyleTTS2 model specifically adapted for high-quality single-speaker speech synthesis using the LJSpeech dataset. StyleTTS2 is a state-of-the-art text-to-speech model known for its expressive and natural-sounding voice synthesis achieved through a style diffusion mechanism.

Our finetuning process began with a robust multispeaker StyleTTS2 model, pretrained by the original authors on the extensive LibriTTS dataset for 20 epochs. This base model provides a strong foundation in learning general speech characteristics. We then specialized this model by finetuning it on the LJSpeech dataset, which comprises approximately 1 hour of speech data (around 1,000 audio samples) from a single speaker. This targeted finetuning for 50 epochs allows the model to capture the unique voice characteristics and nuances of the LJSpeech speaker. The methodology employed here demonstrates a transferable approach: StyleTTS2 can be effectively adapted to generate speech in virtually any voice, provided sufficient audio samples are available for finetuning.

Checkpoint Details

This repository includes checkpoints from two separate finetuning runs, located in the following subdirectories:

no-slm-discriminator: These checkpoints resulted from a finetuning run where the Speech Language Model (WavLM) was intentionally excluded as a discriminator in the style diffusion process. This decision was made due to Out-of-Memory (OOM) errors encountered on a single NVIDIA RTX 3090. Despite this modification, the finetuning proceeded successfully, taking approximately 9 hours, 23 minutes, and 54 seconds on the aforementioned hardware. Checkpoints are available at 5-epoch intervals, ranging from epoch_2nd_00004.pth to epoch_2nd_00049.pth.
with-slm-discriminator: This set of checkpoints comes from a finetuning run that utilized the Speech Language Model (WavLM) as a discriminator, aligning with the default StyleTTS2 configuration. This integration leverages the powerful representations of WavLM to guide the style diffusion process, potentially leading to enhanced speech naturalness. This more computationally intensive run took approximately 2 days and 18 hours to complete on a single NVIDIA RTX 3090. Similar to the other run, checkpoints are provided every 5 epochs, from epoch_2nd_00004.pth to epoch_2nd_00049.pth.

Training Details

Base Model: StyleTTS2 (pretrained on LibriTTS for 20 epochs)
Finetuning Dataset: LJSpeech (1 hour subset, ~1k samples)
Number of Epochs: 50
Hardware (Run 1 - No SLM): 1 x NVIDIA RTX 3090
Hardware (Run 2 - With SLM): 1 x NVIDIA RTX 3090
Training Time (Run 1): ~9 hours 24 minutes
Training Time (Run 2): ~2 days 18 hours

Usage

To leverage these finetuned StyleTTS 2 checkpoints, ensure you have the original StyleTTS2 codebase properly set up. The provided checkpoints can then be loaded using the framework's designated loading mechanisms, often involving configuration files that specify the model architecture and training parameters. Below is a general Python example illustrating how you might load a checkpoint. Remember to adjust the file paths according to your local setup and the specific loading functions provided by the StyleTTS 2 implementation.

import torch

# Example for loading a checkpoint (adjust paths as needed)
checkpoint_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/epoch_2nd_00049.pth"
config_path_with_slm = "huggingface_hub::ibrazebra/lj-speech-finetuned-styletts2/with-slm-discriminator/config_ft.yml"

checkpoint_with_slm = torch.hub.load_state_dict_from_url(checkpoint_path_with_slm)
# Load this state dictionary into your StyleTTS 2 model configured with SLM discriminator