AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling Paper • 2402.12226 • Published Feb 19 • 41
M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition Paper • 2401.11649 • Published Jan 22 • 3
Gen4Gen: Generative Data Pipeline for Generative Multi-Concept Composition Paper • 2402.15504 • Published Feb 23 • 21
EMO: Emote Portrait Alive - Generating Expressive Portrait Videos with Audio2Video Diffusion Model under Weak Conditions Paper • 2402.17485 • Published Feb 27 • 190
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web Paper • 2402.17553 • Published Feb 27 • 21
Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning Paper • 2309.02591 • Published Sep 5, 2023 • 14
ChatMusician: Understanding and Generating Music Intrinsically with LLM Paper • 2402.16153 • Published Feb 25 • 56
Valley: Video Assistant with Large Language model Enhanced abilitY Paper • 2306.07207 • Published Jun 12, 2023 • 2
LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model Paper • 2304.15010 • Published Apr 28, 2023 • 4
Otter: A Multi-Modal Model with In-Context Instruction Tuning Paper • 2305.03726 • Published May 5, 2023 • 6
DeepSeek-VL: Towards Real-World Vision-Language Understanding Paper • 2403.05525 • Published Mar 8 • 39
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context Paper • 2403.05530 • Published Mar 8 • 61
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE Paper • 2308.11971 • Published Aug 23, 2023 • 2