Have you tried finetuning Large Multimodal Models?

by RoadToNowhere - opened Jun 28

Jun 28

Seems those multimodal models integrated OpenAI Whsiper as encoder perform significantly better than using Whisper only in ASR tasks.
For exemple, Qwen/Qwen2-Audio-7B-Instruct, Qwen/Qwen2.5-Omni-7B, microsoft/Phi-4-multimodal-instruct, moonshotai/Kimi-Audio-7B......

efwkjn

Owner Jun 29

I think performance is mostly from data, my common voice improvement is similar to those. I want to try gemma 3n if I have motivation after

RoadToNowhere

Jun 30

Gemma-3n uses google's closed sourced audio encoder, and has no detailed report of benchmarks. It's hard to say how it performs on Japanese ASR tasks.

efwkjn

Owner Jun 30

They open sourced last week enough, I don't have enough data to train a good encoder so the pretraining with youtube access is interesting. Lots of models overfit on standard dataset audio like phi-4 was a lot worse than whisper when I tested it on my data.

RoadToNowhere

Jul 12

I try to transcribe some Japanese ASMR audios, gemma-3n performs even worse than phi-4 in both CER and WER. It's ability to output punctuations is so bad.
But these are models without finetuning on Japanese datasets, wait for your good news.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment