LLaVE is a series of large language and vision embedding models trained on a variety of multimodal embedding datasets
-
zhibinlan/LLaVE-0.5B
Image-Text-to-Text • Updated • 2.98k • 7 -
zhibinlan/LLaVE-2B
Image-Text-to-Text • Updated • 20.1k • 45 -
zhibinlan/LLaVE-7B
Image-Text-to-Text • Updated • 1.46k • 5 -
LLaVE: Large Language and Vision Embedding Models with Hardness-Weighted Contrastive Learning
Paper • 2503.04812 • Published • 14