Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion
Abstract
We present Florence-VL, a new family of multimodal large language models (MLLMs) with enriched visual representations produced by Florence-2, a generative vision foundation model. Unlike the widely used CLIP-style vision transformer trained by contrastive learning, Florence-2 can capture different levels and aspects of visual features, which are more versatile to be adapted to diverse downstream tasks. We propose a novel feature-fusion architecture and an innovative training recipe that effectively integrates Florence-2's visual features into pretrained LLMs, such as Phi 3.5 and LLama 3. In particular, we propose "depth-breath fusion (DBFusion)" to fuse the visual features extracted from different depths and under multiple prompts. Our model training is composed of end-to-end pretraining of the whole model followed by finetuning of the projection layer and the LLM, on a carefully designed recipe of diverse open-source datasets that include high-quality image captions and instruction-tuning pairs. Our quantitative analysis and visualization of Florence-VL's visual features show its advantages over popular vision encoders on vision-language alignment, where the enriched depth and breath play important roles. Florence-VL achieves significant improvements over existing state-of-the-art MLLMs across various multi-modal and vision-centric benchmarks covering general VQA, perception, hallucination, OCR, Chart, knowledge-intensive understanding, etc. To facilitate future research, our models and the complete training recipe are open-sourced. https://github.com/JiuhaiChen/Florence-VL
Community
Microsoft Research releases Florence-VL, a new family of MLLMs powered by the generative vision foundation model Florence-2. Demo: https://huggingface.co/spaces/jiuhai/Florence-VL-8B
Hi @jiuhai congrats on this paper! Would you be interesting in setup a Slack channel to discuss this?
Some suggestions:
- the model and demo could be transferred to the microsoft org: https://huggingface.co/microsoft
- feel free to claim the paper as yours (by clicking "claim as author" on your name)
- would be great to add a model card, adding
pipeline_tag: image-text-to-text
as metadata so that the model can be found at https://huggingface.co/models?pipeline_tag=image-text-to-text&sort=trending
Hi @jiuhai ,
First of all, congratulations on this incredible work!
I’m particularly interested in Section 4, "Analysis on Different Vision Encoders", especially regarding the cross-modal alignment between vision encoders and language models. Specifically, I’m curious about the correlation between alignment loss and the performance of the vision encoder when integrated into an MLLM.
Have you conducted any experiments to assess this correlation? Alternatively, do you think it’s simply a reasonable assumption that these two factors are correlated? If possible, could you provide any references or further insights on this topic?
Thank you!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding (2024)
- Panther: Illuminate the Sight of Multimodal LLMs with Instruction-Guided Visual Prompts (2024)
- Improving Multi-modal Large Language Model through Boosting Vision Capabilities (2024)
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning (2024)
- LLM2CLIP: Powerful Language Model Unlocks Richer Visual Representation (2024)
- PUMA: Empowering Unified MLLM with Multi-granular Visual Generation (2024)
- Efficient Multi-modal Large Language Models via Visual Token Grouping (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper