Add project page, Github repo and paper

This PR improves the model card, by adding a link to:

- the project page: https://chat.qwenlm.ai
- the Github repository: https://github.com/QwenLM/Qwen2.5-VL
- the paper: https://hf.co/papers/2502.13923

Files changed (1) hide show

README.md +90 -47

README.md CHANGED Viewed

@@ -1,15 +1,14 @@
 ---
-license_name: qwen-research
-license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
 language:
 - en
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
-library_name: transformers
-base_model:
-- Qwen/Qwen2.5-VL-3B-Instruct
 ---
 # Qwen2.5-VL-3B-Instruct
@@ -17,6 +16,10 @@ base_model:
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
 ## Introduction
 In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
@@ -125,7 +128,7 @@ KeyError: 'qwen2_5_vl'
 We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 ```bash
-# It's highly recommanded to use `[decord]` feature for faster video loading.
 pip install qwen-vl-utils[decord]==0.0.8
 ```
@@ -136,7 +139,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
 Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
 ```python
-from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
 from qwen_vl_utils import process_vision_info
 # default: Load the model on the available device(s)
@@ -293,7 +296,6 @@ messages = [
     }
 ]
-#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
 # Preparation for inference
 text = processor.apply_chat_template(
     messages, tokenize=False, add_generation_prompt=True
@@ -382,7 +384,6 @@ print(output_texts)
 ### 🤖 ModelScope
 We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
 ### More Usage Tips
 For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
@@ -428,18 +429,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
-    "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
 Besides, We provide two methods for fine-grained control over the image size input to the model:
-1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
-2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
 ```python
-# min_pixels and max_pixels
 messages = [
     {
         "role": "user",
@@ -454,7 +455,7 @@ messages = [
         ],
     }
 ]
-# resized_height and resized_width
 messages = [
     {
         "role": "user",
@@ -471,6 +472,78 @@ messages = [
 ]
 ```
 ### Processing Long Texts
 The current `config.json` is set for context length up to 32,768 tokens.
@@ -494,34 +567,4 @@ For supported frameworks, you could add the following to `config.json` to enable
 However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
-At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
-## Citation
-If you find our work helpful, feel free to give us a cite.
-```
-@misc{qwen2.5-VL,
-    title = {Qwen2.5-VL},
-    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
-    author = {Qwen Team},
-    month = {January},
-    year = {2025}
-}
-@article{Qwen2VL,
-  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
-  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
-  journal={arXiv preprint arXiv:2409.12191},
-  year={2024}
-}
-@article{Qwen-VL,
-  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
-  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
-  journal={arXiv preprint arXiv:2308.12966},
-  year={2023}
-}
-```

 ---
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
 language:
 - en
+library_name: transformers
+license_name: qwen-research
+license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
 pipeline_tag: image-text-to-text
 tags:
 - multimodal
 ---
 # Qwen2.5-VL-3B-Instruct
     <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
 </a>
+Official Repo: https://github.com/QwenLM/Qwen2.5-VL
+This model is presented in the paper [Qwen2.5-VL Technical Report](https://huggingface.co/papers/2502.13923).
 ## Introduction
 In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
 We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
 ```bash
+# It's highly recommended to use `[decord]` feature for faster video loading.
 pip install qwen-vl-utils[decord]==0.0.8
 ```
 Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
 ```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
 from qwen_vl_utils import process_vision_info
 # default: Load the model on the available device(s)
     }
 ]
 # Preparation for inference
 text = processor.apply_chat_template(
     messages, tokenize=False, add_generation_prompt=True
 ### 🤖 ModelScope
 We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
 ### More Usage Tips
 For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 min_pixels = 256 * 28 * 28
 max_pixels = 1280 * 28 * 28
 processor = AutoProcessor.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
 )
 ```
 Besides, We provide two methods for fine-grained control over the image size input to the model:
+1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
+2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
 ```python
+# resized_height and resized_width
 messages = [
     {
         "role": "user",
         ],
     }
 ]
+# min_pixels and max_pixels
 messages = [
     {
         "role": "user",
 ]
 ```
+#### Add ids for Multiple Image Inputs
+By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
+<details>
+<summary>Add vision ids</summary>
+```python
+conversation = [
+    {
+        "role": "user",
+        "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
+    },
+    {
+        "role": "assistant",
+        "content": "I'm doing well, thank you for asking. How can I assist you today?",
+    },
+    {
+        "role": "user",
+        "content": [
+            {"type": "text", "text": "Can you describe these images and video?"},
+            {"type": "image"},
+            {"type": "image"},
+            {"type": "video"},
+            {"type": "text", "text": "These are from my vacation."},
+        ],
+    },
+    {
+        "role": "assistant",
+        "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
+    },
+    {
+        "role": "user",
+        "content": "It was a trip to the mountains. Can you see the details in the images and video?",
+    },
+]
+# default:
+prompt_without_id = processor.apply_chat_template(
+    conversation, add_generation_prompt=True
+)
+# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
+# add ids
+prompt_with_id = processor.apply_chat_template(
+    conversation, add_generation_prompt=True, add_vision_id=True
+)
+# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
+```
+</details>
+#### Flash-Attention 2 to speed up generation
+First, make sure to install the latest version of Flash Attention 2:
+```bash
+pip install -U flash-attn --no-build-isolation
+```
+Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
+To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "Qwen/Qwen2.5-VL-7B-Instruct",
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+)
+```
 ### Processing Long Texts
 The current `config.json` is set for context length up to 32,768 tokens.
 However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
+At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.