Please submit this model to the Open LLM Leaderboard

by grimjim - opened Jan 7

Discussion

grimjim

Jan 7

•

edited Jan 7

The leaderboard is located here.
https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/

I performed a merge of two o1 models, including yours, and hit an unusually high MATH benchmark of 33.99%.
I posit that your model may be highly capable in mathematical reasoning despite the focus being on medical reasoning.

T145

Jan 7

There's an issue with submitting it to the leaderboard, see here: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard/discussions/1055

Personal merge tests with this model showed very high BBH and MMLU-PRO benchmarks, so I'd expect Skywork has hidden math performance.

T145

Jan 10

@grimjim The issue should be fixed on their end! Now we just need to upvote the model: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/vote

If some votes could also be thrown at my models too I'd appreciate it! 🌴

grimjim

Jan 27

Although the result was diluted, the o1 merge above was able to uplift most benchmarks for another L3.1 8B when merged in. Every bench went up outside of IFEval.
https://huggingface.co/grimjim/SauerHuatuoSkywork-o1-Llama-3.1-8B

I have to wonder how much strength is hidden due to lack of compliance with benchmark formatting requirements, with unremarkable IFEval as a potential sign of untapped benchmark potential with HuaTouGPT-o1 8B.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment