RC-Qwen2VL-2B

πŸ“‘ Arxiv | πŸ€— Checkpoint | πŸ“ Dataset | Github | πŸš€ Personalized Conversation

Introduction

RC-Qwen2-VL is a model focused on Region-level Context-aware Multimodal Understanding (RCMU).

  • Objective: To solve the key challenge of integrating the visual content of specific objects in an image with their associated text for comprehensive understanding.
  • Method: Trained with the RCVIT(Region-level Context-aware Visual Instruction Tuning) method and RCMU dataset, it uses bounding boxes to precisely link visual content with text.
  • Performance & Applications: It achieves outstanding performance on RCMU tasks and is successfully applied in advanced scenarios like multimodal RAG and personalized conversation.

Refer to Qwen2-VL for the requirements:

Requirements:

The code of Qwen2-VL has been in the latest Hugging face transformers and we advise you to build from source with command pip install git+https://github.com/huggingface/transformers, or you might encounter the following error:

KeyError: 'qwen2_vl'

Quickstart:

We offer a toolkit to help you handle various types of visual input more conveniently. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:

pip install qwen-vl-utils
Downloads last month
-
Safetensors
Model size
2.21B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for weihongliang/RC-Qwen2VL-2b

Base model

Qwen/Qwen2-VL-2B
Finetuned
(14)
this model

Space using weihongliang/RC-Qwen2VL-2b 1