-
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 92 -
ProgCo: Program Helps Self-Correction of Large Language Models
Paper • 2501.01264 • Published • 24 -
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
Paper • 2501.01957 • Published • 34 -
BoostStep: Boosting mathematical capability of Large Language Models via improved single-step reasoning
Paper • 2501.03226 • Published • 33
Collections
Discover the best community collections!
Collections including paper arxiv:2501.05453
-
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Paper • 2501.04001 • Published • 36 -
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
Paper • 2501.03895 • Published • 43 -
An Empirical Study of Autoregressive Pre-training from Videos
Paper • 2501.05453 • Published • 29
-
Rethinking Data Selection at Scale: Random Selection is Almost All You Need
Paper • 2410.09335 • Published • 16 -
From Generalist to Specialist: Adapting Vision Language Models via Task-Specific Visual Instruction Tuning
Paper • 2410.06456 • Published • 36 -
Emergent properties with repeated examples
Paper • 2410.07041 • Published • 8 -
Personalized Visual Instruction Tuning
Paper • 2410.07113 • Published • 70
-
LLM Pruning and Distillation in Practice: The Minitron Approach
Paper • 2408.11796 • Published • 57 -
TableBench: A Comprehensive and Complex Benchmark for Table Question Answering
Paper • 2408.09174 • Published • 51 -
To Code, or Not To Code? Exploring Impact of Code in Pre-training
Paper • 2408.10914 • Published • 41 -
Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications
Paper • 2408.11878 • Published • 53
-
DocLLM: A layout-aware generative language model for multimodal document understanding
Paper • 2401.00908 • Published • 181 -
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Paper • 2401.00849 • Published • 15 -
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Paper • 2311.05437 • Published • 48 -
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
Paper • 2311.00571 • Published • 41