Those score benchmarks look insane ...

by mirek190 - opened Aug 29, 2023

Discussion

mirek190

Aug 29, 2023

look

https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard

I curious how look benchmarks for ( programming ) MT-Bench, CoT, HumanEval+, LM-Eval .... how to do that or find it?

jondurbin

Owner Aug 29, 2023

Benchmark scores can be misleading, so take them all with a grain of salt.

I haven't tested against those benchmarks; it takes a lot of time and resources to run some of them. I may try a few, but some (alpaca eval for example), this model performs worse than others because it is uncensored and answers "bad" questions.

jondurbin

Owner Aug 29, 2023

This comment has been hidden

nichedreams

Aug 29, 2023

Any chance to just do the MT-bench? As it's a writing/ERP focused model as opposed to a coding focused model, it'd be one of the better ones to run (if you have the time and resources of course). Thanks again for your work.

jondurbin

Owner Aug 29, 2023

I'll take a look! This model does also have quite a few coding instructions so it may actually do fairly well. The focus is actually much heavier on coding and reasoning than on creative tasks/rp.

yehiaserag

Aug 30, 2023

Man This is the second time I'm writing to you, you and @TheBloke are my heroes (and everyone's), thanks a lot for all the efforts you do.
Congratulation on the top spot, you are a one man army and I wish you all the best.
Respect!

seemorebricks

Aug 30, 2023

These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?

yehiaserag

Sep 1, 2023

•

edited Sep 1, 2023

Not sure what happened but the scores dropped in leaderboard :(

jondurbin

Owner Sep 2, 2023

It had some contamination so I purged and rebuilt.

Not sure what happened but the scores dropped in leaderboard :(

jondurbin

Owner Sep 2, 2023

These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?

I'm not aware of a benchmark for this purpose on 70b models. I know there are some benchmarks others have done for RP but they tend to stop at 34b.

jondurbin changed discussion status to closed Sep 2, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment