Paper2Poster: Towards Multimodal Poster Automation from Scientific Papers Paper • 2505.21497 • Published May 27 • 105
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models Paper • 2505.16854 • Published May 22 • 11
VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary Paper • 2503.09402 • Published Mar 12 • 8
ROICtrl: Boosting Instance Control for Visual Generation Paper • 2411.17949 • Published Nov 27, 2024 • 88
ShowUI: One Vision-Language-Action Model for GUI Visual Agent Paper • 2411.17465 • Published Nov 26, 2024 • 88
VideoLLM-online: Online Video Large Language Model for Streaming Video Paper • 2406.11816 • Published Jun 17, 2024 • 25
VideoGUI: A Benchmark for GUI Automation from Instructional Videos Paper • 2406.10227 • Published Jun 14, 2024 • 9
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training Paper • 2401.00849 • Published Jan 1, 2024 • 17
Too Large; Data Reduction for Vision-Language Pre-Training Paper • 2305.20087 • Published May 31, 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone Paper • 2307.05463 • Published Jul 11, 2023 • 11
VisorGPT: Learning Visual Prior via Generative Pre-Training Paper • 2305.13777 • Published May 23, 2023
UniVTG: Towards Unified Video-Language Temporal Grounding Paper • 2307.16715 • Published Jul 31, 2023 • 11
AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn Paper • 2306.08640 • Published Jun 14, 2023 • 26