MB-MelGAN HiFi PostNets SW v4

MB-MelGAN HiFi PostNets SW v4 is a mel-to-wav model based on the MB-MelGAN architecture with HiFi-GAN discriminator. This model was trained from scratch on trained on real and synthetic audio datasets. Instead of training on ground truth waveform spectrograms, this model was trained on the generated PostNet spectrograms of LightSpeech MFA SW v4. The list of speakers include:

sw-TZ-Victoria
sw-TZ-Victoria-syllables-word
sw-TZ-Victoria-v2
sw-TZ-VictoriaNeural-upsampled-48kHz

This model was trained using the TensorFlowTTS framework. All training was done on a RTX 4090 GPU. All necessary scripts used for training could be found in this Github Fork, as well as the Training metrics logged via Tensorboard.

Model

Model	Config	SR (Hz)	Mel range (Hz)	FFT / Hop / Win (pt)	#steps
`mb-melgan-hifi-postnets-sw-v4`	Link	44.1K	20-11025	2048 / 512 / None	1M

Training Procedure

Feature Extraction Setting

sampling_rate: 44100
hop_size: 512 # Hop size.
format: "npy"

Generator Network Architecture Setting

model_type: "multiband_melgan_generator"

multiband_melgan_generator_params:
    out_channels: 4 # Number of output channels (number of subbands).
    kernel_size: 7 # Kernel size of initial and final conv layers.
    filters: 384 # Initial number of channels for conv layers.
    upsample_scales: [8, 4, 4] # List of Upsampling scales.
    stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack.
    stacks: 4 # Number of stacks in a single residual stack module.
    is_weight_norm: false # Use weight-norm or not.

Discriminator Network Architecture Setting

multiband_melgan_discriminator_params:
    out_channels: 1 # Number of output channels.
    scales: 3 # Number of multi-scales.
    downsample_pooling: "AveragePooling1D" # Pooling type for the input downsampling.
    downsample_pooling_params: # Parameters of the above pooling function.
        pool_size: 4
        strides: 2
    kernel_sizes: [5, 3] # List of kernel size.
    filters: 16 # Number of channels of the initial conv layer.
    max_downsample_filters: 512 # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4] # List of downsampling scales.
    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
    nonlinear_activation_params: # Parameters of nonlinear activation function.
        alpha: 0.2
    is_weight_norm: false # Use weight-norm or not.

hifigan_discriminator_params:
    out_channels: 1 # Number of output channels (number of subbands).
    period_scales: [3, 5, 7, 11, 17, 23, 37] # List of period scales.
    n_layers: 5 # Number of layer of each period discriminator.
    kernel_size: 5 # Kernel size.
    strides: 3 # Strides
    filters: 8 # In Conv filters of each period discriminator
    filter_scales: 4 # Filter scales.
    max_filters: 512 # maximum filters of period discriminator's conv.
    is_weight_norm: false # Use weight-norm or not.

STFT Loss Setting

stft_loss_params:
    fft_lengths: [1024, 2048, 512] # List of FFT size for STFT-based loss.
    frame_steps: [120, 240, 50] # List of hop size for STFT-based loss
    frame_lengths: [600, 1200, 240] # List of window length for STFT-based loss.

subband_stft_loss_params:
    fft_lengths: [384, 683, 171] # List of FFT size for STFT-based loss.
    frame_steps: [30, 60, 10] # List of hop size for STFT-based loss
    frame_lengths: [150, 300, 60] # List of window length for STFT-based loss.

Adversarial Loss Setting

lambda_feat_match: 10.0 # Loss balancing coefficient for feature matching loss
lambda_adv: 2.5 # Loss balancing coefficient for adversarial loss.

Data Loader Setting

batch_size: 32 # Batch size for each GPU with assuming that gradient_accumulation_steps == 1.
eval_batch_size: 16
batch_max_steps: 8192 # Length of each audio in batch for training. Make sure dividable by hop_size.
batch_max_steps_valid: 8192 # Length of each audio for validation. Make sure dividable by hope_size.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
is_shuffle: true # shuffle dataset after each epoch.

Optimizer & Scheduler Setting

generator_optimizer_params:
    lr_fn: "PiecewiseConstantDecay"
    lr_params:
        boundaries: [100000, 150000, 400000, 500000, 600000, 700000]
        values: [0.0005, 0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
    amsgrad: false

discriminator_optimizer_params:
    lr_fn: "PiecewiseConstantDecay"
    lr_params:
        boundaries: [100000, 200000, 300000, 400000, 500000]
        values: [0.00025, 0.000125, 0.0000625, 0.00003125, 0.000015625, 0.000001]
    amsgrad: false

gradient_accumulation_steps: 1

Interval Setting

discriminator_train_start_steps: 200000 # steps begin training discriminator
train_max_steps: 1000000 # Number of training steps.
save_interval_steps: 20000 # Interval steps to save checkpoint.
eval_interval_steps: 5000 # Interval steps to evaluate the network.
log_interval_steps: 200 # Interval steps to record the training log.

Other Setting

num_save_intermediate_results: 1 # Number of batch to be saved as intermediate results.

How to Use

import soundfile as sf
import tensorflow as tf
from tensorflow_tts.inference import TFAutoModel, AutoProcessor

lightspeech = TFAutoModel.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
processor = AutoProcessor.from_pretrained("bookbot/lightspeech-mfa-sw-v4")
mb_melgan = TFAutoModel.from_pretrained("bookbot/mb-melgan-hifi-postnets-sw-v4")

text, speaker_name = "Hello World.", "sw-TZ-Victoria"
input_ids = processor.text_to_sequence(text)

mel, _, _ = lightspeech.inference(
    input_ids=tf.expand_dims(tf.convert_to_tensor(input_ids, dtype=tf.int32), 0),
    speaker_ids=tf.convert_to_tensor(
        [processor.speakers_map[speaker_name]], dtype=tf.int32
    ),
    speed_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    f0_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
    energy_ratios=tf.convert_to_tensor([1.0], dtype=tf.float32),
)

audio = mb_melgan.inference(mel)[0, :, 0]
sf.write("./audio.wav", audio, 44100, "PCM_16")

Disclaimer

Do consider the biases which came from pre-training datasets that may be carried over into the results of this model.

Authors

MB-MelGAN HiFi PostNets SW v4 was trained and evaluated by David Samuel Setiawan, Wilson Wongso. All computation and development are done on local machines.

Framework versions

TensorFlowTTS 1.8
TensorFlow 2.12.0

bookbot
/

mb-melgan-hifi-postnets-sw-v4