Febriyanto's picture

Febriyanto

arpenxd
Β·

AI & ML interests

None yet

Recent Activity

replied to lianghsun's post 2 days ago
With the arrival of Twinkle April β€” Twinkle AI’s annual open-source celebration held every April β€” our community is excited to unveil its very first project: πŸ“Š Twinkle Eval (https://github.com/ai-twinkle/Eval), a next-generation evaluation tool led by our contributor @tedslin . Unlike traditional evaluation tools like iKala’s ievals (https://github.com/ikala-ai/ievals), which can only evaluate language models (LMs) one sample at a time, Twinkle Eval is designed with Large Reasoning Models (LRMs) in mind. As reasoning time increases with more complex models, traditional tools become increasingly inefficient 😲 β€” for example, evaluating LRMs on the https://huggingface.co/datasets/ikala/tmmluplus benchmark could take * half a day without finishing. One question we were especially curious about: Does shuffling multiple-choice answer order impact model accuracy? πŸ€” β†’ See: "Change Answer Order Can Decrease MMLU Accuracy" – arXiv:2406.19470v1 To address these challenges, Twinkle Eval brings three key innovations to the table: 1️⃣ Parallelized evaluation of samples 2️⃣ Multi-round testing for stability 3️⃣ Randomized answer order to test robustness After running experiments, we observed that Twinkle Eval can speed up evaluation by up to 15Γ— πŸš€πŸš€. Interestingly, most models scored slightly lower under the 2️⃣3️⃣ test settings compared to their claimed performance β€” suggesting further benchmarking is needed. This framework also comes with additional tunable parameters and detailed logging of LM behavior per question β€” perfect for those who want to dive deeper. πŸ˜† If you find Twinkle Eval useful, please ⭐ the project and help spread the word πŸ€—
View all activity

Organizations

None yet

models

None public yet

datasets

None public yet