FineTuning for Single Speaker
Hi, I'm new to IndicParler TTS. I'm trying to fine-tune it for a single speaker, but I'm encountering this error: 'TypeError: 'NoneType' object is not subscriptable'.
I suspect the issue might be related to using --feature_extractor_name "parler-tts/dac_44khZ_8kbps" because I couldn't find a feature extractor specifically for IndicParler. I'm a beginner and would appreciate some guidance.
Hi,
We do not train or finetune DAC on Indic Parler TTS data, but rather use the pretrained one from ylacombe/dac_44khz. You should be able to use that. That being said, AutoProcessor.from_pretrained("ai4bharat/indic-parler-tts", trust_remote_code=True) should also work. Would be able to look into it if you can share a code snippet.
Thank you for showing interest in Indic Parler TTS.
First of all, thank you so much for your time. I'm using the following script:
!accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "ai4bharat/indic-parler-tts-pretrained" \ --feature_extractor_name "ylacombe/dac_44khz" \ --description_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --prompt_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --report_to "wandb" \ --overwrite_output_dir true \ --train_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --train_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --train_dataset_config_name "default" \ --train_split_name "train" \ --eval_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --eval_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --eval_dataset_config_name "default" \ --eval_split_name "train" \ --max_eval_samples 8 \ --per_device_eval_batch_size 8 \ --target_audio_column_name "audio" \ --description_column_name "text_description" \ --prompt_column_name "text" \ --max_duration_in_seconds 20 \ --min_duration_in_seconds 2.0 \ --max_text_length 400 \ --preprocessing_num_workers 2 \ --do_train true \ --num_train_epochs 2 \ --gradient_accumulation_steps 18 \ --gradient_checkpointing true \ --per_device_train_batch_size 2 \ --learning_rate 0.00008 \ --adam_beta1 0.9 \ --adam_beta2 0.99 \ --weight_decay 0.01 \ --lr_scheduler_type "constant_with_warmup" \ --warmup_steps 50 \ --logging_steps 2 \ --freeze_text_encoder true \ --audio_encoder_per_device_batch_size 4 \ --dtype "float16" \ --seed 456 \ --output_dir "./output_dir_training/" \ --temporary_save_to_disk "./audio_code_tmp/" \ --save_to_disk "./tmp_dataset_audio/" \ --dataloader_num_workers 2 \ --do_eval \ --predict_with_generate \ --include_inputs_for_metrics \ --group_by_length true
However, I keep getting this error:
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', ...]' returned non-zero exit status 1
When I use the tokenizer (ylacombe/parler-tts-mini-v1-Jenny-colab) for both description and prompt, the process completes without errors, but the output audio quality is terrible. You can check the audio samples here: (https://wandb.ai/sjahk-/parler-speech/reports/Speech-samples-24-12-20-19-31-39---VmlldzoxMDY3NzI5Mw?accessToken=lmtsm2zj12qoc0nl8os0dgpdgyorvbufbgrqjnzfb1bqmfxmnak35cnxspoo6pgc)
Could you please guide me on the appropriate description and prompt tokenizer to use for fine-tuning in Hindi? Thanks in advance!
Any help would mean a lot! I believe the issue might be with the prompt or description tokenizer.
Hi @skjdhuhsnjd ,
Please use flan-t5-large tokenizer as that is our description encoder as well. This model works pretty well for our use case as the descriptions are still in English, and FlanT5 is instruction tuned which means better representations even without training it.
For any clarification on which models where used, please look at the config: https://huggingface.co/ai4bharat/indic-parler-tts/blob/main/config.json
First of all, thank you so much for your time. I’m really sorry to bother you, but as a beginner, your help means a lot to me. I was using this notebook:
to fine-tune the Indic Parler pretrained model.
I replaced the model path with "ai4bharat/indic-parler-tts-pretrained"
, the prompt and description tokenizer with "google/flan-t5-large"
, and the feature extractor with "ylacombe/dac_44khz"
.
However, I’m still encountering this error:TypeError: dacmodel.encode() got an unexpected keyword argument 'bandwidth'
I’d be incredibly grateful if you could take some time from your busy schedule to guide me through this issue. Thank you so much in advance!
which version of transformers are you using?