ARGUS: Hallucination and Omission Evaluation in Video-LLMs Paper • 2506.07371 • Published 8 days ago • 8
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published 11 days ago • 32
Can Language Models Falsify? Evaluating Algorithmic Reasoning with Counterexample Creation Paper • 2502.19414 • Published Feb 26 • 20
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference Paper • 2502.09974 • Published Feb 14 • 9
Has My System Prompt Been Used? Large Language Model Prompt Membership Inference Paper • 2502.09974 • Published Feb 14 • 9
Gemstones: A Model Suite for Multi-Faceted Scaling Laws Paper • 2502.06857 • Published Feb 7 • 25
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach Paper • 2502.05171 • Published Feb 7 • 142
Cut Your Losses in Large-Vocabulary Language Models Paper • 2411.09009 • Published Nov 13, 2024 • 50
Running 2 2 CoTaEval Leaderboard 🚀 View and filter a leaderboard of language model evaluation results