Spaces:
Running
on
CPU Upgrade
How to calculate GPQA score?
Hello, I've been trying to reproduce leaderboard results for meta-llama/Meta-Llama-3-8B.
I noticed that the GPQA score listed on the leaderboard is 7.38:
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard
However, when I checked the model details, GPQA scores range from 0.25 to 0.34:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json#L152-L171
Could you clarify how the 7.38 score is derived from these individual scores?
Thank you!
Hi @JJaeuk ,
Thank you for the question!
That's true, the normalised GPQA score for meta-llama/Meta-Llama-3-8B
is 7.38
, but if you see at the GPQA Raw
column it's 0.31
, but we normalised it. You can find more info on normalisation in our documentation here:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization
Or in our V2 blogpost:
https://huggingface.co/spaces/open-llm-leaderboard/blog
If you have any questions I will help you to understand the normalisation!
The issue with your answers is that since the lower bound is 0.25 then the normalized results R should be R=100.0*(0.31-0.25)/0.75==8.00% which is different from 7.38 .
So you have not really answered the question unless you provide a better explanation about how you normalized.
Regarding the performance, i notice that different versions of lm_eval have different performance ranges with the same model. and that only depends on lm-harness implementation i guess.
Here is the exact normalisation function for the GPQA score:
# Normalization function
def normalize_within_range(value, lower_bound=0, higher_bound=1):
return (np.clip(value - lower_bound, 0, None)) / (higher_bound - lower_bound) * 100
# Normalize GPQA scores
gpqa_raw_score = data['results']['leaderboard_gpqa']['acc_norm,none']
gpqa_score = normalize_within_range(gpqa_raw_score, 0.25, 1.0)
gpqa_raw_score, normalize_within_range(gpqa_raw_score, 0.25, 1.0)
Output: (0.3053691275167785, 7.38255033557047)
You can try to apply it to the recent results file:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json
According to your calculations, you used the raw score of 0.31, but it's a rounded value. The actual raw score is 0.3053691275167785
. You can also find it here in the Contents dataset:
https://huggingface.co/datasets/open-llm-leaderboard/contents
Considering lm_eval
, for the V2 we use this our fork:
https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess
You can find more info in the reproducibility section:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility
I think it should be clear enough now, so let me close this discussion. Please, feel free to open a new one in case of any questions!