Finetuning on a single speaker

by mishra999 - opened Feb 6

Feb 6

Hey, i am getting the error DacModel.encode() got an unexpected keyword argument 'bandwidth while trying to finetune the model on a single speaker.
i use the following configs:
!accelerate launch ./training/run_parler_tts_training.py
--model_name_or_path "HelpingAI/HelpingAI-TTS-v1"
--feature_extractor_name "ylacombe/dac_44khz"
--description_tokenizer_name "google/flan-t5-large"
--prompt_tokenizer_name "google/flan-t5-large"
--report_to "tensorboard"
--overwrite_output_dir true
--train_dataset_name "man-ml/my_audio_syn"
--train_metadata_dataset_name "man-ml/my_audio_syn-Emma"
--train_dataset_config_name "default"
--train_split_name "train"
--eval_dataset_name "man-ml/my_audio_syn"
--eval_metadata_dataset_name "man-ml/my_audio_syn-Emma"
--eval_dataset_config_name "default"
--eval_split_name "train"
--max_eval_samples 8
--per_device_eval_batch_size 8
--target_audio_column_name "audio"
--description_column_name "text_description"
--prompt_column_name "text"
--max_duration_in_seconds 20
--min_duration_in_seconds 2.0
--max_text_length 400
--preprocessing_num_workers 2
--do_train true
--num_train_epochs 2
--gradient_accumulation_steps 18
--gradient_checkpointing true
--per_device_train_batch_size 2
--learning_rate 0.0001
--adam_beta1 0.9
--adam_beta2 0.99
--weight_decay 0.01
--lr_scheduler_type "constant_with_warmup"
--warmup_steps 50
--logging_steps 2
--freeze_text_encoder true
--audio_encoder_per_device_batch_size 5
--dtype "float16"
--seed 456
--output_dir "./output_dir_training/"
--temporary_save_to_disk "./audio_code_tmp/"
--save_to_disk "./tmp_dataset_audio/"
--dataloader_num_workers 2
--do_eval
--predict_with_generate
--include_inputs_for_metrics
--group_by_length true

mishra999

Feb 6

Any help would be appreciated

mishra999 changed discussion status to closed Feb 6

k1-m

2 days ago

Hi,
Could you let me know which dac package worked for compilation of this code?

In my case, Following compilation/interpreting error occured when I tried to use this model:
generation = model.generate(input_ids=input_ids, prompt_input_ids=prompt_input_ids, attention_mask=attention_mask,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\torch\utils_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "D:_Develop_dev\px3.pixi\envs\default\Lib\site-packages\parler_tts\modeling_parler_tts.py", line 3637, in generate
sample = self.audio_encoder.decode(audio_codes=sample[None, ...], **single_audio_decode_kwargs).audio_values
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: DacModel.decode() missing 1 required positional argument: 'quantized_representation'

(I added attentionmask for description, prompt due to errors still this DAC error is seen),

Is there any specific transformers version that worked for you?
Thanks in advance

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment