daekeun-ml commited on
Commit
0d91403
·
verified ·
1 Parent(s): a4e17fa

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -13
README.md CHANGED
@@ -27,16 +27,16 @@ model-index:
27
  type: seastar105/fleurs_ko_en_test
28
  metrics:
29
  - type: bleu
30
- value: 7.03
31
  name: ko2en
32
  - type: bleu
33
- value: 7.04
34
  name: ko2en-cot
35
  - type: bleu
36
- value: 12.5
37
  name: en2ko (ko-mecab)
38
  - type: bleu
39
- value: 9.54
40
  name: en2ko-cot (ko-mecab)
41
  - task:
42
  type: automatic-speech-recognition
@@ -45,8 +45,11 @@ model-index:
45
  type: kresnik/zeroth_korean
46
  metrics:
47
  - type: cer
48
- value: 7.02
49
  name: test CER
 
 
 
50
  ---
51
 
52
  # Phi-4-multimodal-finetune-ko-speech
@@ -62,6 +65,9 @@ Total 35K samples. Each sample is a pair of Korean speech and its transcription.
62
 
63
  The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
64
 
 
 
 
65
  Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
66
 
67
  Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
@@ -69,19 +75,20 @@ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-te
69
  ## Evaluation
70
 
71
  Evaluation was done on the following datasets:
72
- - ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
73
- - AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation result (270 samples).
74
 
75
  Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
76
 
77
  Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
78
 
79
- | Model | zeroth-test | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
80
- |----------------------|-------------|--------------|------------------|--------------|------------------|
81
- | original | 198.32 | 5.63 | 2.42 | 6.86 | 4.17 |
82
- | finetune (4 epochs) | 2.72 | 7.11 | 9.95 | 13.22 | 10.45 |
83
- | finetune (1 epoch) | 3.80 | 7.03 | 7.04 | 12.50 | 9.54 |
84
- | Phi-4-mm-inst-zeroth-kor | 7.02 | 7.07 | 9.19 | 13.08 | 9.35 |
 
85
 
86
  ## Usage
87
 
 
27
  type: seastar105/fleurs_ko_en_test
28
  metrics:
29
  - type: bleu
30
+ value: 7.67
31
  name: ko2en
32
  - type: bleu
33
+ value: 8.38
34
  name: ko2en-cot
35
  - type: bleu
36
+ value: 12.31
37
  name: en2ko (ko-mecab)
38
  - type: bleu
39
+ value: 9.69
40
  name: en2ko-cot (ko-mecab)
41
  - task:
42
  type: automatic-speech-recognition
 
45
  type: kresnik/zeroth_korean
46
  metrics:
47
  - type: cer
48
+ value: 1.61
49
  name: test CER
50
+ - type: wer
51
+ value: 3.54
52
+ name: test WER
53
  ---
54
 
55
  # Phi-4-multimodal-finetune-ko-speech
 
65
 
66
  The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
67
 
68
+ The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
69
+ Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is 1.61% and 2.72%, and the WER on zeroth-test set is 3.54% and 7.19%, respectively.
70
+
71
  Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
72
 
73
  Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
 
75
  ## Evaluation
76
 
77
  Evaluation was done on the following datasets:
78
+ - ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on zeroth-test set (457 samples).
79
+ - AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation test set (270 samples).
80
 
81
  Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
82
 
83
  Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
84
 
85
+ | Model | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
86
+ |--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
87
+ | original | 99.16 | 99.63 | 5.63 | 2.42 | 6.86 | 4.17 |
88
+ | Ours - speech full finetune (4 epochs) | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 |
89
+ | LoRA finetune (4 epochs) | 2.72 | 7.19 | 7.11 | 9.95 | 13.22 | 10.45 |
90
+ | LoRA finetune (1 epoch) | 3.80 | 11.52 | 7.03 | 7.04 | 12.50 | 9.54 |
91
+ | Phi-4-mm-inst-zeroth-kor | 7.02 | 17.31 | 7.07 | 9.19 | 13.08 | 9.35 |
92
 
93
  ## Usage
94