Those score benchmarks look insane ...
look
https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard
I curious how look benchmarks for ( programming ) MT-Bench, CoT, HumanEval+, LM-Eval .... how to do that or find it?
Benchmark scores can be misleading, so take them all with a grain of salt.
I haven't tested against those benchmarks; it takes a lot of time and resources to run some of them. I may try a few, but some (alpaca eval for example), this model performs worse than others because it is uncensored and answers "bad" questions.
Any chance to just do the MT-bench? As it's a writing/ERP focused model as opposed to a coding focused model, it'd be one of the better ones to run (if you have the time and resources of course). Thanks again for your work.
I'll take a look! This model does also have quite a few coding instructions so it may actually do fairly well. The focus is actually much heavier on coding and reasoning than on creative tasks/rp.
Man This is the second time I'm writing to you, you and
@TheBloke
are my heroes (and everyone's), thanks a lot for all the efforts you do.
Congratulation on the top spot, you are a one man army and I wish you all the best.
Respect!
These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?
Not sure what happened but the scores dropped in leaderboard :(
It had some contamination so I purged and rebuilt.
Not sure what happened but the scores dropped in leaderboard :(
These models are great, but how well do they compare to llama 2 chat models in multi-turn conversations? Is there a benchmark for this?
I'm not aware of a benchmark for this purpose on 70b models. I know there are some benchmarks others have done for RP but they tend to stop at 34b.