RC-Qwen2VL-2B

📑 Arxiv | 🤗 Checkpoint | 📁 Dataset | Github | 🚀 Personalized Conversation

Introduction

RC-Qwen2-VL is a model focused on Region-level Context-aware Multimodal Understanding (RCMU).

Objective: To solve the key challenge of integrating the visual content of specific objects in an image with their associated text for comprehensive understanding.
Method: Trained with the RCVIT(Region-level Context-aware Visual Instruction Tuning) method and RCMU dataset, it uses bounding boxes to precisely link visual content with text.
Performance & Applications: It achieves outstanding performance on RCMU tasks and is successfully applied in advanced scenarios like multimodal RAG and personalized conversation.

Refer to Qwen2-VL for the requirements:

Requirements:

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:
KeyError: 'qwen2_vl'
Quickstart:

We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
pip install qwen-vl-utils

weihongliang
/

RC-Qwen2VL-2b

RC-Qwen2VL-2B

Introduction

Refer to Qwen2-VL for the requirements:

Requirements:

Quickstart:

Model tree for weihongliang/RC-Qwen2VL-2b

Space using weihongliang/RC-Qwen2VL-2b 1