Qwen/Qwen2.5-Omni-7B · How to do multimodal multi task finetuning.

I am looking into ways where I can push the multimodal limits of the model

I plan on leveraging the multimodal capabilites of Qwen 2.5 Omini and train a model capable of

OCR
ASR , AST, timestamp decoding
translation

while I have a huge large-scale training corpus in all of the tasks to finetune the model

Just wanted to know of there is possibly a way to use the huggingface codebase to fientune the model on all these task at once. perhaps mix and train all of these task together.

just wanted help on how could I train this ? incase anyone had references to any code that would help me do this.