Establishing Trustworthy LLM Evaluation via Shortcut Neuron Analysis Paper • 2506.04142 • Published 25 days ago • 27
MMR-V: What's Left Unsaid? A Benchmark for Multimodal Deep Reasoning in Videos Paper • 2506.04141 • Published 25 days ago • 29
LongWriter-V: Enabling Ultra-Long and High-Fidelity Generation in Vision-Language Models Paper • 2502.14834 • Published Feb 20 • 24
WaterBench: Towards Holistic Evaluation of Watermarks for Large Language Models Paper • 2311.07138 • Published Nov 13, 2023 • 2
KoLA: Carefully Benchmarking World Knowledge of Large Language Models Paper • 2306.09296 • Published Jun 15, 2023 • 19
A Solution-based LLM API-using Methodology for Academic Information Seeking Paper • 2405.15165 • Published May 24, 2024
Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack Paper • 2406.11682 • Published Jun 17, 2024
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks Paper • 2412.15204 • Published Dec 19, 2024 • 38
From MOOC to MAIC: Reshaping Online Teaching and Learning through LLM-driven Agents Paper • 2409.03512 • Published Sep 5, 2024 • 29