On Many-Shot In-Context Learning for Long-Context Evaluation Paper • 2411.07130 • Published Nov 11, 2024 • 7
ExpertLongBench: Benchmarking Language Models on Expert-Level Long-Form Generation Tasks with Structured Checklists Paper • 2506.01241 • Published Jun 2 • 9
CLASH: Evaluating Language Models on Judging High-Stakes Dilemmas from Multiple Perspectives Paper • 2504.10823 • Published Apr 15 • 15
MLRC-Bench: Can Language Agents Solve Machine Learning Research Challenges? Paper • 2504.09702 • Published Apr 13 • 18