I want to try it and make the training for other voices from slr, advices?
Hi there, nice to have that voice for piper. I'm interested in testing the voice, maybe modifying voices.json
would allow to make a pull request to the original from piper, it maybe would be the fastest way to offer voices for the project. Seems that the original author can be kind of busy(The gh repo hasn't been accepting PR recently).
Maybe the MODEL_CARD would be something like(specifying the initial checkpoint)
# Model card for daniela (x_high)
* Language: es_AR (Spanish, Argentina)
* Speakers: 1
* Quality: x_high
* Samplerate: 22,050Hz
## Dataset
* URL: https://www.openslr.org/61/
* License: Attribution-ShareAlike 4.0 International
## Training
Trained from https://huggingface.co/datasets/rhasspy/piper-checkpoints/
I would like to know which hardware did you use and how much computational time it took.
It would be maybe a good idea to make a python script to automate this porting voices from slr to piper. I would like to invest some time on it.
Thanks for your work and any responses are greatly appreciated.
I'm new to hf, but it seems that repositories have more or less the same dynamics as in github, cloning the repo shallow and adding your files and modifying the voices.json, could be an approach to maybe not needing to download all the assets from the original repo.
Daniela manda decir que
Thanks Igortamara.
I used a RTX 3090 24vram. and it took something around 4-6hs to get this results (dont remember exactly). Can be done with less hardware, I used 24 batch and run it for like 1500 epochs.
I also tried on a laptop gpu 4060 with 8vram and 4 batch size. Works too.
It can probably be done on a CPU, but havent tried it myself.
Hey @larcanio thank you for the information.
I added a space https://huggingface.co/spaces/igortamara/sample-tts-piper to showcase easily the voice you trained.
If you have any suggestion, feel free to let me know.
Thanks Igor! Appreciate the contribution.
I've updated the model card linking to your demo
Hey @larcanio , thank you, I have some work on Kaggle notebooks to train on their GPUS, I have noticed on the training that the "r" and "rr" are kind of weak(more noticeable on the voice I finetuned than Danielas).
I have two hypotheses
- lessac high voice biases to avoid the r and rr training.
- I need a better dataset on my side with more rr and r sampling.
Do you have the same feeling about the r and rr pronounciation? or maybe you have some other ideas to improve it?
Hey thanks, yes I noted the bias in the R. The dataset only contains aronud 140 samples from each voice so probably too little. Probably a good idea whould be to create syntetic data with another voice cloning and then train withi it.
That's a good idea. Making a synthetic voice read text, and use that as starting point to train for the target voices. Thanks. I have to check which voice is a good one. For now I have two notebooks in Kaggle, one for [selecting texts with pronunciations] that I guess are kind of flaky](https://www.kaggle.com/code/igortamara/choose-audios-from-dataset) right now. And make the synthetic voice read those to see the quality.
On the other hand, I also made an experiment of a voice trained from different voices to see the mix. Given that the resulting voice is not as great. I went to look for clustering on a 120h dataset, to extract only the speeches of the same voice. It's probing to be a timing process.
If maybe you have a suggestion of a synthetic voice to use as starting point, I will be happy to try with it.