FineTuning for Single Speaker

by skjdhuhsnjd - opened 5 days ago

5 days ago

Hi, I'm new to IndicParler TTS. I'm trying to fine-tune it for a single speaker, but I'm encountering this error: 'TypeError: 'NoneType' object is not subscriptable'.

I suspect the issue might be related to using --feature_extractor_name "parler-tts/dac_44khZ_8kbps" because I couldn't find a feature extractor specifically for IndicParler. I'm a beginner and would appreciate some guidance.

AshwinSankar

AI4Bharat org 5 days ago

Hi,

We do not train or finetune DAC on Indic Parler TTS data, but rather use the pretrained one from ylacombe/dac_44khz. You should be able to use that. That being said, AutoProcessor.from_pretrained("ai4bharat/indic-parler-tts", trust_remote_code=True) should also work. Would be able to look into it if you can share a code snippet.

Thank you for showing interest in Indic Parler TTS.

skjdhuhsnjd

5 days ago

•

edited 3 days ago

First of all, thank you so much for your time. I'm using the following script:

!accelerate launch ./training/run_parler_tts_training.py \ --model_name_or_path "ai4bharat/indic-parler-tts-pretrained" \ --feature_extractor_name "ylacombe/dac_44khz" \ --description_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --prompt_tokenizer_name "ai4bharat/indic-parler-tts-pretrained" \ --report_to "wandb" \ --overwrite_output_dir true \ --train_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --train_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --train_dataset_config_name "default" \ --train_split_name "train" \ --eval_dataset_name "mavihsrr/Hindi_TTS_M-2k" \ --eval_metadata_dataset_name "skjdhuhsnjd/h-t-tagged" \ --eval_dataset_config_name "default" \ --eval_split_name "train" \ --max_eval_samples 8 \ --per_device_eval_batch_size 8 \ --target_audio_column_name "audio" \ --description_column_name "text_description" \ --prompt_column_name "text" \ --max_duration_in_seconds 20 \ --min_duration_in_seconds 2.0 \ --max_text_length 400 \ --preprocessing_num_workers 2 \ --do_train true \ --num_train_epochs 2 \ --gradient_accumulation_steps 18 \ --gradient_checkpointing true \ --per_device_train_batch_size 2 \ --learning_rate 0.00008 \ --adam_beta1 0.9 \ --adam_beta2 0.99 \ --weight_decay 0.01 \ --lr_scheduler_type "constant_with_warmup" \ --warmup_steps 50 \ --logging_steps 2 \ --freeze_text_encoder true \ --audio_encoder_per_device_batch_size 4 \ --dtype "float16" \ --seed 456 \ --output_dir "./output_dir_training/" \ --temporary_save_to_disk "./audio_code_tmp/" \ --save_to_disk "./tmp_dataset_audio/" \ --dataloader_num_workers 2 \ --do_eval \ --predict_with_generate \ --include_inputs_for_metrics \ --group_by_length true

However, I keep getting this error:
subprocess.CalledProcessError: Command '['/usr/bin/python3', './training/run_parler_tts_training.py', ...]' returned non-zero exit status 1

When I use the tokenizer (ylacombe/parler-tts-mini-v1-Jenny-colab) for both description and prompt, the process completes without errors, but the output audio quality is terrible. You can check the audio samples here: (https://wandb.ai/sjahk-/parler-speech/reports/Speech-samples-24-12-20-19-31-39---VmlldzoxMDY3NzI5Mw?accessToken=lmtsm2zj12qoc0nl8os0dgpdgyorvbufbgrqjnzfb1bqmfxmnak35cnxspoo6pgc)

Could you please guide me on the appropriate description and prompt tokenizer to use for fine-tuning in Hindi? Thanks in advance!

skjdhuhsnjd

4 days ago

Any help would mean a lot! I believe the issue might be with the prompt or description tokenizer.

AshwinSankar

AI4Bharat org 4 days ago

Hi @skjdhuhsnjd ,

Please use flan-t5-large tokenizer as that is our description encoder as well. This model works pretty well for our use case as the descriptions are still in English, and FlanT5 is instruction tuned which means better representations even without training it.

AshwinSankar

AI4Bharat org 3 days ago

For any clarification on which models where used, please look at the config: https://huggingface.co/ai4bharat/indic-parler-tts/blob/main/config.json

skjdhuhsnjd

3 days ago

•

edited 3 days ago

Hi @AshwinSankar

First of all, thank you so much for your time. I’m really sorry to bother you, but as a beginner, your help means a lot to me. I was using this notebook:

https://colab.research.google.com/github/ylacombe/scripts_and_notebooks/blob/main/Finetuning_Parler_TTS_on_a_single_speaker_dataset.ipynb

to fine-tune the Indic Parler pretrained model.

I replaced the model path with "ai4bharat/indic-parler-tts-pretrained", the prompt and description tokenizer with "google/flan-t5-large", and the feature extractor with "ylacombe/dac_44khz".

However, I’m still encountering this error:
TypeError: dacmodel.encode() got an unexpected keyword argument 'bandwidth'

I’d be incredibly grateful if you could take some time from your busy schedule to guide me through this issue. Thank you so much in advance!

AshwinSankar

AI4Bharat org about 21 hours ago

which version of transformers are you using?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment