181
MMLU Pro
🥇
More advanced and challenging multi-task evaluation
More advanced and challenging multi-task evaluation
Compare LLMs on role consistency across contexts
Embed and use ZeroEval for evaluation tasks
Display model leaderboard evaluations
Browse and submit LLM evaluations
Compact LLM Battle Arena: Frugal AI Face-Off!
VLMEvalKit Eval Results in video understanding benchmark
Track, rank and evaluate open LLMs and chatbots
Blind vote on HF TTS models!