RC-Qwen2VL-2B
π Arxiv | π€ Checkpoint | π Dataset | Github | π Personalized Conversation
Introduction
RC-Qwen2-VL is a model focused on Region-level Context-aware Multimodal Understanding (RCMU).
- Objective: To solve the key challenge of integrating the visual content of specific objects in an image with their associated text for comprehensive understanding.
- Method: Trained with the RCVIT(Region-level Context-aware Visual Instruction Tuning) method and RCMU dataset, it uses bounding boxes to precisely link visual content with text.
- Performance & Applications: It achieves outstanding performance on RCMU tasks and is successfully applied in advanced scenarios like multimodal RAG and personalized conversation.
Refer to Qwen2-VL for the requirements:
Requirements:
The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command
pip install git+https://github.com/huggingface/transformers
, or you might encounter the following error:KeyError: 'qwen2_vl'
Quickstart:
We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
pip install qwen-vl-utils
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support
Model tree for weihongliang/RC-Qwen2VL-2b
Base model
Qwen/Qwen2-VL-2B