daekeun-ml
/

Phi-4-multimodal-finetune-ko-speech

@@ -27,16 +27,16 @@ model-index:
       type: seastar105/fleurs_ko_en_test
     metrics:
     - type: bleu
-      value: 7.03
       name: ko2en
     - type: bleu
-      value: 7.04
       name: ko2en-cot
     - type: bleu
-      value: 12.5
       name: en2ko (ko-mecab)
     - type: bleu
-      value: 9.54
       name: en2ko-cot (ko-mecab)
   - task:
       type: automatic-speech-recognition
@@ -45,8 +45,11 @@ model-index:
       type: kresnik/zeroth_korean
     metrics:
     - type: cer
-      value: 7.02
       name: test CER
 ---
 # Phi-4-multimodal-finetune-ko-speech
@@ -62,6 +65,9 @@ Total 35K samples. Each sample is a pair of Korean speech and its transcription.
 The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
 Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
 Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
@@ -69,19 +75,20 @@ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-te
 ## Evaluation
 Evaluation was done on the following datasets:
-- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
-- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
 Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
 Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
-| Model                | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
-|----------------------|-------------|--------------|------------------|--------------|------------------|
-| original             |  198.32     | 5.63         | 2.42             | 6.86         | 4.17             |
-| finetune (4 epochs) |  2.72       | 7.11         | 9.95             | 13.22        | 10.45            |
-| finetune (1 epoch) |  3.80       | 7.03         | 7.04             | 12.50        | 9.54             |
-| Phi-4-mm-inst-zeroth-kor |  7.02       | 7.07         | 9.19             | 13.08        | 9.35             |
 ## Usage

       type: seastar105/fleurs_ko_en_test
     metrics:
     - type: bleu
+      value: 7.67
       name: ko2en
     - type: bleu
+      value: 8.38
       name: ko2en-cot
     - type: bleu
+      value: 12.31
       name: en2ko (ko-mecab)
     - type: bleu
+      value: 9.69
       name: en2ko-cot (ko-mecab)
   - task:
       type: automatic-speech-recognition
       type: kresnik/zeroth_korean
     metrics:
     - type: cer
+      value: 1.61
       name: test CER
+    - type: wer
+      value: 3.54
+      name: test WER
 ---
 # Phi-4-multimodal-finetune-ko-speech
 The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
+The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
+Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is 1.61% and 2.72%, and the WER on zeroth-test set is 3.54% and 7.19%, respectively.
 Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
 Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
 ## Evaluation
 Evaluation was done on the following datasets:
+- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on zeroth-test set (457 samples).
+- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation test set (270 samples).
 Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
 Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
+| Model                          | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
+|--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
+| original                       | 99.16       | 99.63       | 5.63         | 2.42             | 6.86         | 4.17             |
+| Ours - speech full finetune (4 epochs) | 1.61        | 3.54        | 7.67         | 8.38             | 12.31        | 9.69             |
+| LoRA finetune (4 epochs)        | 2.72        | 7.19        | 7.11         | 9.95             | 13.22        | 10.45            |
+| LoRA finetune (1 epoch)         | 3.80        | 11.52       | 7.03         | 7.04             | 12.50        | 9.54             |
+| Phi-4-mm-inst-zeroth-kor        | 7.02        | 17.31       | 7.07         | 9.19             | 13.08        | 9.35             |
 ## Usage