Spaces:
Running
on
CPU Upgrade
WizardMath scores on GSM8k are much lower on the new leaderboard than on their paper
Hello,
I was looking at the new GSM8k results on some of the math oriented models and upon checking the results of the WizardMath models, I see that they scored between 2 and 12 on the leaderboard, but on their paper (https://huggingface.co/WizardLM/WizardMath-13B-V1.0), they are saying that the 70B model scored 81 and the 13B model scored 64.
Upon checking the dataset of the results of the GSM8k (https://huggingface.co/datasets/open-llm-leaderboard/details_WizardLM__WizardMath-13B-V1.0/blob/main/2023-10-12T22-45-52.861079/details_harness%7Cgsm8k%7C5_2023-10-12T22-45-52.861079.parquet), I see that on some failed tests, the model did not finish its answer. The answer is truncated, which cause the test to fail.
Example: on the first line, we have
Question: "Jared is trying to increase his typing speed. He starts with 47 words per minute (WPM). After some lessons the next time he tests his typing speed it has increased to 52 WPM. If he continues to increase his typing speed once more by 5 words, what will be the average of the three measurements? "
Answer: "First, we need to find the total increase in typing speed"
Could there be a configuration error on the testing process that caused the answers to be truncated, which caused those very low scores?
Are their reported results in 5-shots?
It could be that the context length of these models is too small to accommodate a long enough prompt.
Tangentially, WizardMath also requires a very specific system prompt, and we don't allow system prompts on the leaderboard (ht @osanseviero ).
In regards to WizardMath and GSM8k, does it even make sense to not allow custom prompts? Since the previous benchmarks were all multiple choice and the model's answer was selected based on the logprobs of each choice, it was acceptable. However, given that GSM8k is evaluated differently and requires the model to generate an actual output, from which the answer is extracted, the prompt template is crucial as we're evaluating actual model-generated outputs rather than probabilities of given choices.
I understand that allowing custom prompts is a whole other can of worms itself in regard to fairness, but I think we can agree that this approach is not fair either.
Hi! We took the decision to allow system prompts soon, following interesting discussions such as this one. Thank you for your inputs :)
We'll make sure to communicate on it once it's done.
I'm closing the issue in the meantime!