Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1068

Score gap in arc challenge

#115

by wonhosong - opened Jul 18, 2023

Discussion

wonhosong

Jul 18, 2023

•

edited Jul 18, 2023

Hello, first of all, It is big thanks for running the open llm leaderboards.

I saw our model(upstage/llama-30b-instruct-2048)'s score on the leaderbaord and noticed a gap in the score, so I'm reaching out to you.
The arc_challenge score on the leaderboard is 58.3, but on the local reproduction leaderboard it is 65.19.
Here are scripts what i used for local evaluating.

# download model weights
git clone https://huggingface.co/upstage/llama-30b-instruct-2048

# load evaluation code
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
cd lm-evaluation-harness

# run evaluation scripts
python main.py --model=hf-causal --model_args="pretrained=../llama-30b-instruct-2048" --tasks=arc_challenge --num_fewshot=25 --batch_size=2 --no_cache

I also saw other team reporting score gaps, and I understood that there was an error in the evaluation code.
Could you please rerun our models as well?

The model name is upstage/llama-30b-instruct-2048 and upstage/llama-30b-instruct.

arielnlee

Jul 18, 2023

•

edited Jul 19, 2023

I am also experiencing this with ariellee/SuperPlatty-30B and lilloukas/GPlatty-30B—leaderboard lists 59.2 and 60.1, respectively, instead of 66.1 and 66. Could you rerun those 2 as well? Thank you!

clefourrier

Open LLM Leaderboard org Jul 19, 2023

Hi! @SaylorTwift is re-running all llama based models atm, since llama models have a different management of white space tokens, which means they were handicapped by the previous version of the Harness. We'll update the leaderboard as soon as possible :)

wonhosong

Jul 19, 2023

@clefourrier @SaylorTwift Big thanks for your effort!

SaylorTwift

Open LLM Leaderboard org Jul 19, 2023

Hey @wonhosong ! Thanks for your feedback :) When you say 65.19, is it acc or acc_norm score ? I just reran your model and I get acc=0.620 and acc_norm=0.649 on ARC challenge.

wonhosong

Jul 19, 2023

@SaylorTwift It's acc_norm! I've confirmed that our model was evaluated correctly, there is no difference between local and public scores in the updated leaderboard, thank you :)

SaylorTwift changed discussion status to closed Jul 19, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment