MMHU: A Massive-Scale Multimodal Benchmark for Human Behavior Understanding Paper • 2507.12463 • Published Jul 16 • 26
A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality Paper • 2507.07202 • Published Jul 9 • 22
Demystifying the Visual Quality Paradox in Multimodal Large Language Models Paper • 2506.15645 • Published Jun 18 • 4
SAFEFLOW: A Principled Protocol for Trustworthy and Transactional Autonomous Agent Systems Paper • 2506.07564 • Published Jun 9 • 6
Edit Away and My Face Will not Stay: Personal Biometric Defense against Malicious Generative Editing Paper • 2411.16832 • Published Nov 25, 2024 • 2
DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models Paper • 2505.24025 • Published May 29 • 27
MoDoMoDo: Multi-Domain Data Mixtures for Multimodal LLM Reinforcement Learning Paper • 2505.24871 • Published May 30 • 22
On the Trustworthiness of Generative Foundation Models: Guideline, Assessment, and Perspective Paper • 2502.14296 • Published Feb 20 • 46
LLaVA-3D: A Simple yet Effective Pathway to Empowering LMMs with 3D-awareness Paper • 2409.18125 • Published Sep 26, 2024 • 35
TIP: Text-Driven Image Processing with Semantic and Restoration Instructions Paper • 2312.11595 • Published Dec 18, 2023 • 6