OpenEvals
community
AI & ML interests
LLM evaluation
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 227 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 242 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper • 2412.03304 • Published • 20
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
122
Open-LLM performances are plateauing, let’s make the leaderboard steep again
🏔Explore and compare advanced language models on a new leaderboard
-
13.5k
Open LLM Leaderboard
🏆Track, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer • Updated • 4.58k • 9.04k • 19 -
open-llm-leaderboard/results
Preview • Updated • 2.04k • 15
This leaderboard has been evaluating LLMs from Jun 2024 on IFEval, MuSR, GPQA, MATH, BBH and MMLU-Pro
-
122
Open-LLM performances are plateauing, let’s make the leaderboard steep again
🏔Explore and compare advanced language models on a new leaderboard
-
13.5k
Open LLM Leaderboard
🏆Track, rank and evaluate open LLMs and chatbots
-
open-llm-leaderboard/contents
Viewer • Updated • 4.58k • 9.04k • 19 -
open-llm-leaderboard/results
Preview • Updated • 2.04k • 15
A small overview of our research collabs through the years
-
GAIA: a benchmark for General AI Assistants
Paper • 2311.12983 • Published • 227 -
Zephyr: Direct Distillation of LM Alignment
Paper • 2310.16944 • Published • 122 -
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model
Paper • 2502.02737 • Published • 242 -
Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation
Paper • 2412.03304 • Published • 20
This leaderboard evaluated 7K LLMs from Apr 2023 to Jun 2024, on ARC-c, HellaSwag, MMLU, TruthfulQA, Winogrande and GSM8K