LLaVA-Onevision Collection LLaVa_Onevision models for single-image, multi-image, and video scenarios • 9 items • Updated 7 days ago • 9
Prithvi WxC: Foundation Model for Weather and Climate Paper • 2409.13598 • Published 5 days ago • 26
AuroraCap Collection Efficient, Performant Video Detailed Captioning and a New Benchmark • 8 items • Updated about 16 hours ago • 1
CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets Paper • 2406.13897 • Published May 30 • 12
InfiMM-WebMath-40B: Advancing Multimodal Pre-Training for Enhanced Mathematical Reasoning Paper • 2409.12568 • Published 6 days ago • 44
See and Think: Embodied Agent in Virtual Environment Paper • 2311.15209 • Published Nov 26, 2023 • 2
InternVideo2: Scaling Video Foundation Models for Multimodal Video Understanding Paper • 2403.15377 • Published Mar 22 • 21
Meta Llama 3 Collection This collection hosts the transformers and original repos of the Meta Llama 3 and Llama Guard 2 releases • 5 items • Updated Aug 2 • 674
OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments Paper • 2404.07972 • Published Apr 11 • 43
Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models Paper • 2402.07865 • Published Feb 12 • 12
Qwen1.5 Collection Qwen1.5 is the improved version of Qwen, the large language model series developed by Alibaba Cloud. • 55 items • Updated 7 days ago • 205
Controllable Human-Object Interaction Synthesis Paper • 2312.03913 • Published Dec 6, 2023 • 22
Dolphins: Multimodal Language Model for Driving Paper • 2312.00438 • Published Dec 1, 2023 • 12
CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation Paper • 2311.18775 • Published Nov 30, 2023 • 6
Doppelgangers: Learning to Disambiguate Images of Similar Structures Paper • 2309.02420 • Published Sep 5, 2023 • 9
Emergence of Segmentation with Minimalistic White-Box Transformers Paper • 2308.16271 • Published Aug 30, 2023 • 13
Global Adaptation meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation Paper • 2303.16456 • Published Mar 29, 2023 • 1
StableVideo: Text-driven Consistency-aware Diffusion Video Editing Paper • 2308.09592 • Published Aug 18, 2023 • 2
MovieChat: From Dense Token to Sparse Memory for Long Video Understanding Paper • 2307.16449 • Published Jul 31, 2023 • 15
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World Paper • 2308.01907 • Published Aug 3, 2023 • 10
To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation Paper • 2307.15063 • Published Jul 27, 2023 • 17
DreamTeacher: Pretraining Image Backbones with Deep Generative Models Paper • 2307.07487 • Published Jul 14, 2023 • 19