Update README.md
Browse files
README.md
CHANGED
@@ -27,16 +27,16 @@ model-index:
|
|
27 |
type: seastar105/fleurs_ko_en_test
|
28 |
metrics:
|
29 |
- type: bleu
|
30 |
-
value: 7.
|
31 |
name: ko2en
|
32 |
- type: bleu
|
33 |
-
value:
|
34 |
name: ko2en-cot
|
35 |
- type: bleu
|
36 |
-
value: 12.
|
37 |
name: en2ko (ko-mecab)
|
38 |
- type: bleu
|
39 |
-
value: 9.
|
40 |
name: en2ko-cot (ko-mecab)
|
41 |
- task:
|
42 |
type: automatic-speech-recognition
|
@@ -45,8 +45,11 @@ model-index:
|
|
45 |
type: kresnik/zeroth_korean
|
46 |
metrics:
|
47 |
- type: cer
|
48 |
-
value:
|
49 |
name: test CER
|
|
|
|
|
|
|
50 |
---
|
51 |
|
52 |
# Phi-4-multimodal-finetune-ko-speech
|
@@ -62,6 +65,9 @@ Total 35K samples. Each sample is a pair of Korean speech and its transcription.
|
|
62 |
|
63 |
The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
|
64 |
|
|
|
|
|
|
|
65 |
Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
|
66 |
|
67 |
Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
|
@@ -69,19 +75,20 @@ Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-te
|
|
69 |
## Evaluation
|
70 |
|
71 |
Evaluation was done on the following datasets:
|
72 |
-
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) on zeroth-test set (457 samples).
|
73 |
-
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation
|
74 |
|
75 |
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
|
76 |
|
77 |
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
|
78 |
|
79 |
-
| Model
|
80 |
-
|
81 |
-
| original
|
82 |
-
| finetune (4 epochs) |
|
83 |
-
| finetune (
|
84 |
-
|
|
|
|
85 |
|
86 |
## Usage
|
87 |
|
|
|
27 |
type: seastar105/fleurs_ko_en_test
|
28 |
metrics:
|
29 |
- type: bleu
|
30 |
+
value: 7.67
|
31 |
name: ko2en
|
32 |
- type: bleu
|
33 |
+
value: 8.38
|
34 |
name: ko2en-cot
|
35 |
- type: bleu
|
36 |
+
value: 12.31
|
37 |
name: en2ko (ko-mecab)
|
38 |
- type: bleu
|
39 |
+
value: 9.69
|
40 |
name: en2ko-cot (ko-mecab)
|
41 |
- task:
|
42 |
type: automatic-speech-recognition
|
|
|
45 |
type: kresnik/zeroth_korean
|
46 |
metrics:
|
47 |
- type: cer
|
48 |
+
value: 1.61
|
49 |
name: test CER
|
50 |
+
- type: wer
|
51 |
+
value: 3.54
|
52 |
+
name: test WER
|
53 |
---
|
54 |
|
55 |
# Phi-4-multimodal-finetune-ko-speech
|
|
|
65 |
|
66 |
The model was trained on a single A100 80GB GPU for 4 epochs with a batch size of 16 using the `sample_finetune_speech.py` script from [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct)
|
67 |
|
68 |
+
The latest version of the model currently uploaded was fine-tuned by **unfreezing the audio encoder**, and the ASR performance was significantly improved compared to the baseline LoRA adapter-based fine-tuning.
|
69 |
+
Comparing the full fine-tuning and LoRA fine-tuning, the CER on zeroth-test set is 1.61% and 2.72%, and the WER on zeroth-test set is 3.54% and 7.19%, respectively.
|
70 |
+
|
71 |
Note that this model is just a PoC/experimental purpose, and not intended to be used in production. More high-quality data, tuning, ablation studies, and experiments are needed.
|
72 |
|
73 |
Phi-4-multimodal model is strong in multimodal tasks, especially in speech-to-text and high potential in Korean language tasks. Thus if you are interested in Korean speech-to-text task, this model can be a good starting point.
|
|
|
75 |
## Evaluation
|
76 |
|
77 |
Evaluation was done on the following datasets:
|
78 |
+
- ASR (Automatic Speech Recognition): Evaluated with CER (Character Error Rate) and WER (Word Error Rate) on zeroth-test set (457 samples).
|
79 |
+
- AST (Automatic Speech Translation): Evaluated with BLEU score on fleurs ko <-> en speech translation test set (270 samples).
|
80 |
|
81 |
Script is retrieved from [here](https://gist.github.com/seastar105/d1d8983b27611370528e3b194dcc5577#file-evaluate-py).
|
82 |
|
83 |
Compared to [Phi-4-mm-inst-zeroth-kor](https://huggingface.co/seastar105/Phi-4-mm-inst-zeroth-kor), ASR is significantly improved with more high-quality voice data and my own voice. However, the quality of AST deteriorates for fleurs-ko2en-cot, so appropriate data should be inserted in between to improve catastrophic forgetting.
|
84 |
|
85 |
+
| Model | zeroth (CER) | zeroth (WER) | fleurs-ko2en | fleurs-ko2en-cot | fleurs-en2ko | fleurs-en2ko-cot |
|
86 |
+
|--------------------------------|-------------|-------------|--------------|------------------|--------------|------------------|
|
87 |
+
| original | 99.16 | 99.63 | 5.63 | 2.42 | 6.86 | 4.17 |
|
88 |
+
| Ours - speech full finetune (4 epochs) | 1.61 | 3.54 | 7.67 | 8.38 | 12.31 | 9.69 |
|
89 |
+
| LoRA finetune (4 epochs) | 2.72 | 7.19 | 7.11 | 9.95 | 13.22 | 10.45 |
|
90 |
+
| LoRA finetune (1 epoch) | 3.80 | 11.52 | 7.03 | 7.04 | 12.50 | 9.54 |
|
91 |
+
| Phi-4-mm-inst-zeroth-kor | 7.02 | 17.31 | 7.07 | 9.19 | 13.08 | 9.35 |
|
92 |
|
93 |
## Usage
|
94 |
|