Tracks perf of LLMs, VLMs and agents on web navigation tasks
Customize interface to judge inequalities
Evaluate AI Models on Gameplays
BOOM: Benchmark Of Observability Metrics
Leaderboard of LLMs based on detailed human feedback
First benchmark testing LLM guards on safety and accuracy.
LLM Robustness leaderboard
This is FACTS Grounding Leaderboard, but for Open LLMs!
Display and filter LLM benchmark results
A byte-level map of the Hugging Face Hub
Track history of Follows of organizations and users on HF
Trace Reasoning and Agentic Issue Localization Leaderboard
Interact with an agent to perform web-based tasks
Show detailed model outputs for specific benchmarks