Adding Phi-4-Multimodal

#32
by Steveeeeeeen HF staff - opened
Hugging Face for Audio org
No description provided.
Hugging Face for Audio org

Here are the results of Phi-4-Multimodal on the Open ASR leaderboard benchmarks:

Filtering models by id: microsoft/Phi-4-multimodal-instruct


Results per dataset:


microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_ami_test: WER = 11.45 %, RTFx = 33.35
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_earnings22_test: WER = 10.50 %, RTFx = 33.66
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_gigaspeech_test: WER = 9.77 %, RTFx = 41.77
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_librispeech_test.clea: WER = 1.67 %, RTFx = 47.28
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_librispeech_test.other: WER = 3.82 %, RTFx = 45.86
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_spgispeech_test: WER = 3.11 %, RTFx = 49.44
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_tedlium_test: WER = 2.89 %, RTFx = 43.44
microsoft/Phi-4-multimodal-instruct | hf-audio-esb-datasets-test-only-sorted_voxpopuli_test: WER = 5.93 %, RTFx = 47.18


Composite Results:


microsoft/Phi-4-multimodal-instruct: WER = 6.14 %
microsoft/Phi-4-multimodal-instruct: RTFx = 45.52


Here is the results reported in the technical report in comparaison:

image.png

Steveeeeeeen changed pull request status to open
Steveeeeeeen changed pull request status to merged

Sign up or log in to comment