Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1143

How to calculate GPQA score?

#928

by JJaeuk - opened Sep 13, 2024

Discussion

JJaeuk

Sep 13, 2024

Hello, I've been trying to reproduce leaderboard results for meta-llama/Meta-Llama-3-8B.

I noticed that the GPQA score listed on the leaderboard is 7.38:
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

However, when I checked the model details, GPQA scores range from 0.25 to 0.34:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json#L152-L171

Could you clarify how the 7.38 score is derived from these individual scores?

Thank you!

Jechan

Sep 14, 2024

This comment has been hidden

alozowski

Open LLM Leaderboard org Sep 17, 2024

Hi @JJaeuk ,

Thank you for the question!
That's true, the normalised GPQA score for meta-llama/Meta-Llama-3-8B is 7.38, but if you see at the GPQA Raw column it's 0.31, but we normalised it. You can find more info on normalisation in our documentation here:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/normalization

Or in our V2 blogpost:
https://huggingface.co/spaces/open-llm-leaderboard/blog

If you have any questions I will help you to understand the normalisation!

bedio

Sep 17, 2024

•

edited Sep 17, 2024

The issue with your answers is that since the lower bound is 0.25 then the normalized results R should be R=100.0*(0.31-0.25)/0.75==8.00% which is different from 7.38 .
So you have not really answered the question unless you provide a better explanation about how you normalized.

Regarding the performance, i notice that different versions of lm_eval have different performance ranges with the same model. and that only depends on lm-harness implementation i guess.

alozowski

Open LLM Leaderboard org Sep 18, 2024

Here is the exact normalisation function for the GPQA score:

# Normalization function
def normalize_within_range(value, lower_bound=0, higher_bound=1):
    return (np.clip(value - lower_bound, 0, None)) / (higher_bound - lower_bound) * 100

# Normalize GPQA scores
gpqa_raw_score = data['results']['leaderboard_gpqa']['acc_norm,none']
gpqa_score = normalize_within_range(gpqa_raw_score, 0.25, 1.0)

gpqa_raw_score, normalize_within_range(gpqa_raw_score, 0.25, 1.0)

Output: (0.3053691275167785, 7.38255033557047)

You can try to apply it to the recent results file:
https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3-8B-details/blob/main/meta-llama__Meta-Llama-3-8B/results_2024-06-16T19-10-04.926831.json

According to your calculations, you used the raw score of 0.31, but it's a rounded value. The actual raw score is 0.3053691275167785. You can also find it here in the Contents dataset:
https://huggingface.co/datasets/open-llm-leaderboard/contents

Considering lm_eval, for the V2 we use this our fork:
https://github.com/huggingface/lm-evaluation-harness/tree/adding_all_changess

You can find more info in the reproducibility section:
https://huggingface.co/docs/leaderboards/open_llm_leaderboard/about#reproducibility

I think it should be clear enough now, so let me close this discussion. Please, feel free to open a new one in case of any questions!

alozowski changed discussion status to closed Sep 18, 2024

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment