About DROP results within the `lm-eval-harness`

#13
by alvarobartt HF staff - opened

Hi here! I'm curious about the huge gap w.r.t. Mistral in the DROP benchmark of the lm-eval-harness, did you use the same revision of EleutherAI/lm-eval-harness? Also did you run any other evaluation to see the reason why it excels that much at DROP compared to other SFT + DPO fine-tunes e.g. Zephyr? Is there any data contamination coming from the dataset used for training?

A bunch of questions πŸ˜… Feel free to answer in case you checked the issues with DROP, because the gap compared to other models seems huge and would be nice to investigate, maybe the data just has better quality!

I'm only chiming in so I will get a notice if someone answers this question because I'm also interested in why this LLM has a much higher DROP score than other SFT + DPO LLMs like Zephyr.

hi, the used datasets are listed at the model card. We find the metric of drop decreases during the training. So early stopping is needed.

True, I've submitted a PR at https://huggingface.co/Intel/neural-chat-7b-v3-1/discussions/15 to enrich the metadata within the Model Card in the README.md πŸ€—

Hmm did you evaluate the model during training using the lm-eval-harness? Was DROP within your evaluation set? Could you please elaborate more on that, I think it's a really interesting topic!

Sign up or log in to comment