cannot reproduce gsm8k score with vllm

#596
by HanNayeoniee - opened

Dear Maintainers,
thanks for making and sharing this leaderboard!

This is what I've done so far.

I submitted my model to this leaderboard and I did reproduce the score with the version specified in the about tab (b281b09).
It took me more than 15 hours to evaluate only gsm8k which is too long.

So I tried evaluating it using vllm from main branch. It took about 1.5 hours. (I think it worth waiting)
Even I used the same fewshot examples and batch_size as in the leaderboard, I couldn't reproduce the score.

I got gsm8k score of 72.48 when using vllm, while the leaderboard reports 68.54: https://huggingface.co/datasets/open-llm-leaderboard/details_HanNayeoniee__LHK_DPO_v1

Is there any way to reproduce score using vllm instead of hf-causal?

Open LLM Leaderboard org
edited Feb 19

Hi, thanks for your issue!

I don't know what are the differences in implementation between vllm and hf-causal inference in the harness, but we will keep to using the latter for now, to ensure full reproduciblity between the different models evals.
If you want to reproduce the results of your model in an acceptable time, you could run hf-causal using max samples = 20, and check if you get the same logprobs/generations for the selection, wdyt?

If you reproduce our results, then the direspancy between hf-causal and vllm should be raised on the harness. If you don't, we might have a bug somewhere and we'll investigate asap.

Open LLM Leaderboard org

Closing for inactivity, feel free to reopen if needed

clefourrier changed discussion status to closed

Sign up or log in to comment