Have you tried finetuning Large Multimodal Models?
Seems those multimodal models integrated OpenAI Whsiper as encoder perform significantly better than using Whisper only in ASR tasks.
For exemple, Qwen/Qwen2-Audio-7B-Instruct, Qwen/Qwen2.5-Omni-7B, microsoft/Phi-4-multimodal-instruct, moonshotai/Kimi-Audio-7B......
I think performance is mostly from data, my common voice improvement is similar to those. I want to try gemma 3n if I have motivation after
Gemma-3n uses google's closed sourced audio encoder, and has no detailed report of benchmarks. It's hard to say how it performs on Japanese ASR tasks.
They open sourced last week enough, I don't have enough data to train a good encoder so the pretraining with youtube access is interesting. Lots of models overfit on standard dataset audio like phi-4 was a lot worse than whisper when I tested it on my data.
I try to transcribe some Japanese ASMR audios, gemma-3n performs even worse than phi-4 in both CER and WER. It's ability to output punctuations is so bad.
But these are models without finetuning on Japanese datasets, wait for your good news.