Evaluation Dataset Size and Details
Hi,
Thanks for creating this space. I want to know more about the subsets used for evaluation, especially if the common voice subset for evaluation uses the full test set.
Also, Why is it that the vhdm/whisper-large-fa-v1 have such a high WER? Is it because of hallucinations or does it generate empty transcripts?
In addition, I want to recommend adding a new benchmark PartAI/PSRB. The given link only contains 1 hour out of the full 10 hour benchmark, but the authors might be willing to share the full dataset for this benchmark space.
Hi,
Based on the benchmark chart and table we published, we evaluate open-source models using several standard, open-source datasets, and Common Voice is one of them. When official splits exist, we use those (including the official test set) for reporting.
Regarding vhdm/whisper-large-fa-v1: it has been widely promoted on social media, which made us curious to test it. In our evaluation, the model appears to have fundamental training issues; its high WER is largely due to empty outputs and hallucinations/out-of-domain text (sometimes in English). We normally wouldn’t include such models on the leaderboard, but given the extensive promotion, we added it to provide transparency about its actual accuracy.
Thanks for suggesting PartAI/PSRB; we’ll review and add it. If we can obtain the full 10-hour benchmark, we’ll use that; otherwise, we’ll start with the 1-hour subset.