About Performance Score(WER)

#1
by yas0005 - opened

Hello.

I'm trying to implement French ASR. We conducted tests on the MLS dataset with the same model as the one you uploaded("bofenghuang/whisper-large-v3-french-distil-dec16"). According to the model card, a performance of 3.57 was achieved in this situation. However, when we ran the test, it was recorded as 4.64, which is slightly lower performance than you mentioned. In training details, you mentioned there were quality issues with the dataset that you fixed.

  1. I'm wondering if there were also quality issue corrections or preprocessing for the test dataset?
  2. If so, would it be possible to share the actual dataset you used for testing?
  3. Additionally, I would like to ask if there were any preprocessing steps you applied to the results before measuring WER, and if so, could you share those details as well?

Thank you,

Hello,
Thank you for your interest!
We used the test set as is, without any corrections, but we ran normalization (lowercase, remove punctuation, etc) on both predictions and ground truths before computing WER. I can't remember if the normalization was the one from Whisper or if I added further normalization specific to French, but the same module was used for all evaluated models.
All normalized predictions/ground truths and per-utterance WER results can be found here.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment