Collections

17

DocLLM: A layout-aware generative language model for multimodal document understanding

Paper • 2401.00908 • Published Dec 31, 2023 • 184
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

Paper • 2401.00849 • Published Jan 1, 2024 • 17
LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Paper • 2311.05437 • Published Nov 9, 2023 • 50
LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

Paper • 2311.00571 • Published Nov 1, 2023 • 41

DocLLM: A layout-aware generative language model for multimodal document understanding

COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

chat-ui

starriver030515/FUSION-LLaMA3.1-8B

starriver030515/FUSION-X-LLaMA3.1-8B

starriver030515/FUSION-X-Phi3.5-3B

starriver030515/FUSION-Phi3.5-3B

starriver030515/FUSION-Finetune-12M

starriver030515/FUSION-Pretrain-10M

starriver030515/FUSION-Synth-4M

starriver030515/FUSION-Eval

Gemma 3 Technical Report

Kimi-VL Technical Report

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

MLLM-as-a-Judge for Image Safety without Human Labeling

2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining

Xmodel-2 Technical Report

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

LLM Pruning and Distillation in Practice: The Minitron Approach

TableBench: A Comprehensive and Complex Benchmark for Table Question Answering

To Code, or Not To Code? Exploring Impact of Code in Pre-training

Open-FinLLMs: Open Multimodal Large Language Models for Financial Applications

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers

MAVIS: Mathematical Visual Instruction Tuning

Kvasir-VQA: A Text-Image Pair GI Tract Dataset

MMEvol: Empowering Multimodal Large Language Models with Evol-Instruct

iVideoGPT: Interactive VideoGPTs are Scalable World Models

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models

An Introduction to Vision-Language Modeling

Matryoshka Multimodal Models

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

ScreenAI: A Vision-Language Model for UI and Infographics Understanding

EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss