--- base_model: - Qwen/Qwen2-VL-2B-Instruct datasets: - rp-yu/VPT_Datasets language: - en library_name: transformers license: apache-2.0 metrics: - accuracy pipeline_tag: image-text-to-text --- # Introducing Visual Perception Token into Multimodal Large Language Model This repository contains models based on the paper [Introducing Visual Perception Token into Multimodal Large Language Model](https://arxiv.org/abs/2502.17425). These models utilize Visual Perception Tokens to enhance the visual perception capabilities of multimodal large language models (MLLMs). Code: https://github.com/yu-rp/VisualPerceptionToken