Spaces:
Running
on
CPU Upgrade
Score gap in arc challenge
Hello, first of all, It is big thanks for running the open llm leaderboards.
I saw our model(upstage/llama-30b-instruct-2048
)'s score on the leaderbaord and noticed a gap in the score, so I'm reaching out to you.
The arc_challenge
score on the leaderboard is 58.3
, but on the local reproduction leaderboard it is 65.19
.
Here are scripts what i used for local evaluating.
# download model weights
git clone https://huggingface.co/upstage/llama-30b-instruct-2048
# load evaluation code
git clone https://github.com/EleutherAI/lm-evaluation-harness.git
git checkout b281b0921b636bc36ad05c0b0b0763bd6dd43463
cd lm-evaluation-harness
# run evaluation scripts
python main.py --model=hf-causal --model_args="pretrained=../llama-30b-instruct-2048" --tasks=arc_challenge --num_fewshot=25 --batch_size=2 --no_cache
I also saw other team reporting score gaps, and I understood that there was an error in the evaluation code.
Could you please rerun our models as well?
The model name is upstage/llama-30b-instruct-2048
and upstage/llama-30b-instruct
.
I am also experiencing this with ariellee/SuperPlatty-30B
and lilloukas/GPlatty-30B
—leaderboard lists 59.2
and 60.1
, respectively, instead of 66.1
and 66
. Could you rerun those 2 as well? Thank you!
Hi! @SaylorTwift is re-running all llama based models atm, since llama models have a different management of white space tokens, which means they were handicapped by the previous version of the Harness. We'll update the leaderboard as soon as possible :)
Hey
@wonhosong
! Thanks for your feedback :) When you say 65.19, is it acc
or acc_norm
score ? I just reran your model and I get acc=0.620
and acc_norm=0.649
on ARC challenge.
@SaylorTwift
It's acc_norm
! I've confirmed that our model was evaluated correctly, there is no difference between local and public scores in the updated leaderboard, thank you :)