JackismyShephard's picture
initial_commit
02b2725
|
raw
history blame
4.65 kB
metadata
language:
  - da
license: mit
base_model: microsoft/speecht5_tts
tags:
  - generated_from_trainer
datasets:
  - alexandrainst/nst-da
model-index:
  - name: speecht5_tts-finetuned-nst-da
    results: []
metrics:
  - mse
pipeline_tag: text-to-speech

speecht5_tts-finetuned-nst-da

This model is a fine-tuned version of microsoft/speecht5_tts on the NST Danish ASR Database dataset. It achieves the following results on the evaluation set:

  • Loss: 0.3738

Model description

Given that danish is a low-resource language, not many open-source implementations of a danish text-to-speech synthesizer are available online. As of writing, the only other existing implementations available on 🤗 are facebook/seamless-streaming and audo/seamless-m4t-v2-large. This model has been developed to provide a simpler alternative that still performs reasonable well, both in terms of output quality and inference time. Additionally, contrary to the aforementioned models, this model also has an associated Space on 🤗 at JackismyShephard/danish-speech-synthesis which provides an easy interface for danish text-to-speech synthesis, as well as optional speech enhancement.

Intended uses & limitations

The model is intended for danish text-to-speech synthesis.

The model does not recognize special symbols such as "æ", "ø" and "å", as it uses the default tokenizer of microsoft/speecht5_tts. The model performs best for short to medium length input text and expects input text to contain no more than 600 vocabulary tokens. Additionally, for best performance the model should be given a danish speaker embedding, ideally generated from an audio clip from the training split of alexandrainst/nst-da using speechbrain/spkrec-xvect-voxceleb.

The output of the model is a log-mel spectogram, which should be converted to a waveform using microsoft/speecht5_hifigan. For better quality output the resulting waveform can be enhanced using ResembleAI/resemble-enhance.

An example script showing how to use the model for inference can be found here.

Training and evaluation data

The model was trained and evaluated on alexandrainst/nst-da using MSE as both loss and metric.

Training procedure

The script used for training the model can be found here

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 32
  • eval_batch_size: 32
  • seed: 42
  • optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_ratio: 0.1
  • num_epochs: 20
  • mixed_precision_training: Native AMP

Training results

Training Loss Epoch Step Validation Loss
0.463 1.0 4715 0.4169
0.4302 2.0 9430 0.3963
0.447 3.0 14145 0.3883
0.4283 4.0 18860 0.3847
0.394 5.0 23575 0.3830
0.3934 6.0 28290 0.3812
0.3928 7.0 33005 0.3795
0.4123 8.0 37720 0.3781
0.3851 9.0 42435 0.3785
0.4234 10.0 47150 0.3783
0.3781 11.0 51865 0.3759
0.3951 12.0 56580 0.3782
0.4073 13.0 61295 0.3757
0.4278 14.0 66010 0.3768
0.4172 15.0 70725 0.3747
0.3854 16.0 75440 0.3753
0.4876 17.0 80155 0.3741
0.432 18.0 84870 0.3738
0.4435 19.0 89585 0.3745
0.4255 20.0 94300 0.3739

Framework versions

  • Transformers 4.37.0.dev0
  • Pytorch 2.1.2+cu118
  • Datasets 2.15.0
  • Tokenizers 0.15.0