Running 50 50 Stick To Your Role! Leaderboard 🎭 Benchmarking LLMs on the stability of simulated populations
Running on CPU Upgrade 221 221 MMLU-Pro Leaderboard 🥇 More advanced and challenging multi-task evaluation