Weight conversion script

#2
by simeneide - opened

Hi, great work on reimplementing csm into transformers! I have, in parallel, trained a Norwegian/Swedish tts model on top of the old code, but this will probably be better going forward. I tried to look for the porting code quickly, is this something that is easily available or sharable?

Hey, sorry for the late answer. You can find the conversion script here! πŸ€—

Great, will give it a go.
Also, and Ill hijack this thread for now, I saw that the processor gives me label=-100 for the audio eos token. Is this correct? Early training shows that it doesnt seem to end but starts hallucinating more audio tokens.

image.png

I am trying a hotfix for it now:

            batch = self.processor.apply_chat_template(
                examples, return_dict=return_dict, tokenize=tokenize, output_labels=output_labels
            )
            # Hotfix: Audio seems to not end when it should. add audio eos token to trainable tokens so it can fix this
            # the audio eos token should be 0, but not sure if this work as intended. using 128003 gave out of vocab error (as its only checking the 2051 audio tokens)
            batch["labels"][batch["input_ids"] == 128003] = 0

Indeed! Forgot to include when cleaning the code, thanks a lot for noticing!!! πŸ€—
I've integrated a fix in https://github.com/huggingface/transformers/pull/38215, so just install from source and you'll be good to go.

Also, you can now directly use weights from https://huggingface.co/sesame/csm-1b πŸ˜‰

Sign up or log in to comment