This model is fine-tuned from microsoft/Phi-4-multimodal-instruct on Bingsu/zeroth-korean, google/flerus in 5 epochs.

This model is trained 960 steps on datasets for Korean Audio Speech Recognition on H100.

After that, we continue training with CoVoST2 Dataset / CoVoST2-Ko for AST.

AST Finetuned model is Here : Phi-4-multimodal-instruct-ko-speech

Evaluation

Evaluation was done on the following datasets:

  • ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
  • AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).

Script is retrieved from here.

Compared to Phi-4-mm-inst-zeroth-kor and Phi-4-multimodal-finetune-ko-speech, ASR is significantly improved.

Model zeroth-CER zeroth-WER fleurs-ko_en-BLEU fleurs-ko_en-cot-BLEU fleurs-en_ko-BLEU fleurs-en_ko-cot-BLEU
original 198.32 - 5.63 2.42 6.86 4.17
daekeun-ml/Phi-4-multimodal-finetune-ko-speech 1.61 3.54 7.67 8.38 12.31 9.69
seastar105/Phi-4-mm-inst-zeroth-kor 7.02 - 7.07 9.19 13.08 9.35
ASR finetune(this model) 1.31 2.95 7.46 6.24 12.15 8.91
+ 1 epoch finetune with Covost-Ko 3.88 - 8.07 10.09 18.82 15.41
AST finetuned model 1.77 2.99 8.01 9.09 17.09 11.82
Downloads last month
219
Safetensors
Model size
5.57B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junnei/Phi-4-multimodal-instruct-ko-asr

Finetuned
(30)
this model

Datasets used to train junnei/Phi-4-multimodal-instruct-ko-asr

Evaluation results