|
## Work In Progress |
|
|
|
- This is a finetuned checkpoint of [HKUSTAudio/Llasa-1B-Multilingual](https://huggingface.co/HKUSTAudio/Llasa-1B-Multilingual), on Cantonese audio data |
|
- Two additional tokens are added `<|YUE_START|>` and `<|YUE_END|>`. The chat template is |
|
``` |
|
formatted_text = f"<|TEXT_UNDERSTANDING_START|><|YUE_START|>{input_text}<|YUE_END|><|TEXT_UNDERSTANDING_END|>" |
|
|
|
chat = [ |
|
{"role": "user", "content": "Convert the text to speech:" + formatted_text}, |
|
{"role": "assistant", "content": "<|SPEECH_GENERATION_START|>" + ''.join(speech_ids_prefix)} |
|
] |
|
``` |
|
|
|
## Roadmap |
|
- [ ] Train on more data |
|
- [ ] Train with emotions, speaker characteristics (gender, age) |
|
- [ ] Benchmark with CER |
|
- [ ] Gradio space |
|
- [ ] Train with [LayerSkip](https://arxiv.org/abs/2404.16710) |
|
- [ ] Train on better filtered data |
|
- [ ] Release training code |