MORSE-500: A Programmatically Controllable Video Benchmark to Stress-Test Multimodal Reasoning Paper • 2506.05523 • Published Jun 5 • 34
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Paper • 2506.04141 • Published Jun 4 • 29
Reinforced Internal-External Knowledge Synergistic Reasoning for Efficient Adaptive Search Agent Paper • 2505.07596 • Published May 12 • 11
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment Paper • 2412.13746 • Published Dec 18, 2024 • 9
RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models Paper • 2406.10890 • Published Jun 16, 2024 • 1