## Work In Progress - This is a finetuned checkpoint of [HKUSTAudio/Llasa-1B-Multilingual](https://huggingface.co/HKUSTAudio/Llasa-1B-Multilingual), on Cantonese audio data - Two additional tokens are added `<|YUE_START|>` and `<|YUE_END|>`. The chat template is ``` formatted_text = f"<|TEXT_UNDERSTANDING_START|><|YUE_START|>{input_text}<|YUE_END|><|TEXT_UNDERSTANDING_END|>" chat = [ {"role": "user", "content": "Convert the text to speech:" + formatted_text}, {"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)} ] ``` ## Roadmap - [ ] Train on more data - [ ] Train with emotions, speaker characteristics (gender, age) - [ ] Benchmark with CER - [ ] Gradio space - [ ] Train with [LayerSkip](https://arxiv.org/abs/2404.16710) - [ ] Train on better filtered data - [ ] Release training code