I want to try it and make the training for other voices from slr, advices?

#2
by igortamara - opened

Hi there, nice to have that voice for piper. I'm interested in testing the voice, maybe modifying voices.json would allow to make a pull request to the original from piper, it maybe would be the fastest way to offer voices for the project. Seems that the original author can be kind of busy(The gh repo hasn't been accepting PR recently).

Maybe the MODEL_CARD would be something like(specifying the initial checkpoint)

# Model card for daniela (x_high)

* Language: es_AR (Spanish, Argentina)
* Speakers: 1
* Quality: x_high
* Samplerate: 22,050Hz

## Dataset

* URL: https://www.openslr.org/61/
* License: Attribution-ShareAlike 4.0 International

## Training

Trained from https://huggingface.co/datasets/rhasspy/piper-checkpoints/

I would like to know which hardware did you use and how much computational time it took.

It would be maybe a good idea to make a python script to automate this porting voices from slr to piper. I would like to invest some time on it.

Thanks for your work and any responses are greatly appreciated.

I'm new to hf, but it seems that repositories have more or less the same dynamics as in github, cloning the repo shallow and adding your files and modifying the voices.json, could be an approach to maybe not needing to download all the assets from the original repo.

igortamara changed discussion title from Is this high or med? I want to try it and make the training for other voices from slr to I want to try it and make the training for other voices from slr, advices?

Daniela manda decir que

Thanks Igortamara.

I used a RTX 3090 24vram. and it took something around 4-6hs to get this results (dont remember exactly). Can be done with less hardware, I used 24 batch and run it for like 1500 epochs.
I also tried on a laptop gpu 4060 with 8vram and 4 batch size. Works too.

It can probably be done on a CPU, but havent tried it myself.

Hey @larcanio thank you for the information.

I added a space https://huggingface.co/spaces/igortamara/sample-tts-piper to showcase easily the voice you trained.

If you have any suggestion, feel free to let me know.

Owner

Thanks Igor! Appreciate the contribution.

I've updated the model card linking to your demo

Hey @larcanio , thank you, I have some work on Kaggle notebooks to train on their GPUS, I have noticed on the training that the "r" and "rr" are kind of weak(more noticeable on the voice I finetuned than Danielas).

I have two hypotheses

  • lessac high voice biases to avoid the r and rr training.
  • I need a better dataset on my side with more rr and r sampling.

Do you have the same feeling about the r and rr pronounciation? or maybe you have some other ideas to improve it?

Owner

Hey thanks, yes I noted the bias in the R. The dataset only contains aronud 140 samples from each voice so probably too little. Probably a good idea whould be to create syntetic data with another voice cloning and then train withi it.

That's a good idea. Making a synthetic voice read text, and use that as starting point to train for the target voices. Thanks. I have to check which voice is a good one. For now I have two notebooks in Kaggle, one for [selecting texts with pronunciations] that I guess are kind of flaky](https://www.kaggle.com/code/igortamara/choose-audios-from-dataset) right now. And make the synthetic voice read those to see the quality.

On the other hand, I also made an experiment of a voice trained from different voices to see the mix. Given that the resulting voice is not as great. I went to look for clustering on a 120h dataset, to extract only the speeches of the same voice. It's probing to be a timing process.

If maybe you have a suggestion of a synthetic voice to use as starting point, I will be happy to try with it.

Sign up or log in to comment