Post
2877
With the arrival of Twinkle April โ Twinkle AIโs annual open-source celebration held every April โ our community is excited to unveil its very first project:
๐ Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin .
Unlike traditional evaluation tools like iKalaโs ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient ๐ฒ โ for example, evaluating LRMs on the ikala/tmmluplus benchmark could take *
half a day without finishing.
One question we were especially curious about:
Does shuffling multiple-choice answer order impact model accuracy? ๐ค
โ See: "Change Answer Order Can Decrease MMLU Accuracy" โ arXiv:2406.19470v1
To address these challenges, Twinkle Eval brings three key innovations to the table:
1๏ธโฃ Parallelized evaluation of samples
2๏ธโฃ Multi-round testing for stability
3๏ธโฃ Randomized answer order to test robustness
After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15ร ๐๐. Interestingly, most models scored slightly lower under the 2๏ธโฃ3๏ธโฃ test settings compared to their claimed performance โ suggesting further benchmarking is needed.
This framework also comes with additional tunable parameters and detailed logging of LM behavior per question โ perfect for those who want to dive deeper. ๐
If you find Twinkle Eval useful, please โญ the project and help spread the word ๐ค