Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities Paper • 2502.05209 • Published Feb 3 • 1
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published 12 days ago • 32
MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published 12 days ago • 32