dipta007
's Collections
VLM
updated
MM-Interleaved: Interleaved Image-Text Generative Modeling via
Multi-modal Feature Synchronizer
Paper
•
2401.10208
•
Published
•
1
ONE-PEACE: Exploring One General Representation Model Toward Unlimited
Modalities
Paper
•
2305.11172
•
Published
•
1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video
Paper
•
2302.00402
•
Published
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
•
2308.12966
•
Published
•
6
Unified Model for Image, Video, Audio and Language Tasks
Paper
•
2307.16184
•
Published
•
14
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Paper
•
2307.13721
•
Published
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Paper
•
2309.03895
•
Published
•
13
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
Visual-Linguistic Tasks
Paper
•
2312.14238
•
Published
•
14
MMBench: Is Your Multi-modal Model an All-around Player?
Paper
•
2307.06281
•
Published
•
5
GPT4All: An Ecosystem of Open Source Compressed Language Models
Paper
•
2311.04931
•
Published
•
20
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
•
2403.05525
•
Published
•
39
nvidia/NVLM-D-72B
Image-Text-to-Text
•
Updated
•
26k
•
732
Qwen/Qwen2-VL-72B-Instruct-AWQ
Image-Text-to-Text
•
Updated
•
22.7k
•
28
Qwen/Qwen2-VL-7B-Instruct
Image-Text-to-Text
•
Updated
•
1.16M
•
•
802
Qwen/Qwen2-VL-72B-Instruct
Image-Text-to-Text
•
Updated
•
63.6k
•
156
HuggingFaceM4/Idefics3-8B-Llama3
Image-Text-to-Text
•
Updated
•
9.44k
•
234
mistralai/Pixtral-12B-2409
Updated
•
480
OpenGVLab/InternVL2-8B
Image-Text-to-Text
•
Updated
•
53.8k
•
143
OpenGVLab/InternVL2-4B
Image-Text-to-Text
•
Updated
•
92.3k
•
42
OpenGVLab/InternVL2-Llama3-76B
Image-Text-to-Text
•
Updated
•
214k
•
203