Have you tried finetuning Large Multimodal Models?

#1
by RoadToNowhere - opened

Seems those multimodal models integrated OpenAI Whsiper as encoder perform significantly better than using Whisper only in ASR tasks.
For exemple, Qwen/Qwen2-Audio-7B-Instruct, Qwen/Qwen2.5-Omni-7B, microsoft/Phi-4-multimodal-instruct, moonshotai/Kimi-Audio-7B......

Owner

I think performance is mostly from data, my common voice improvement is similar to those. I want to try gemma 3n if I have motivation after

Gemma-3n uses google's closed sourced audio encoder, and has no detailed report of benchmarks. It's hard to say how it performs on Japanese ASR tasks.

Owner

They open sourced last week enough, I don't have enough data to train a good encoder so the pretraining with youtube access is interesting. Lots of models overfit on standard dataset audio like phi-4 was a lot worse than whisper when I tested it on my data.

I try to transcribe some Japanese ASMR audios, gemma-3n performs even worse than phi-4 in both CER and WER. It's ability to output punctuations is so bad.
But these are models without finetuning on Japanese datasets, wait for your good news.

Sign up or log in to comment