metadata

base_model:
  - Qwen/Qwen2-VL-2B-Instruct
datasets:
  - rp-yu/VPT_Datasets
language:
  - en
library_name: transformers
license: apache-2.0
metrics:
  - accuracy
pipeline_tag: image-text-to-text

Introducing Visual Perception Token into Multimodal Large Language Model

This repository contains models based on the paper Introducing Visual Perception Token into Multimodal Large Language Model. These models utilize Visual Perception Tokens to enhance the visual perception capabilities of multimodal large language models (MLLMs).

Code: https://github.com/yu-rp/VisualPerceptionToken