Add project page, Github repo and paper
Browse filesThis PR improves the model card, by adding a link to:
- the project page: https://chat.qwenlm.ai
- the Github repository: https://github.com/QwenLM/Qwen2.5-VL
- the paper: https://hf.co/papers/2502.13923
README.md
CHANGED
@@ -1,15 +1,14 @@
|
|
1 |
-
|
2 |
---
|
3 |
-
|
4 |
-
|
5 |
language:
|
6 |
- en
|
|
|
|
|
|
|
7 |
pipeline_tag: image-text-to-text
|
8 |
tags:
|
9 |
- multimodal
|
10 |
-
library_name: transformers
|
11 |
-
base_model:
|
12 |
-
- Qwen/Qwen2.5-VL-3B-Instruct
|
13 |
---
|
14 |
|
15 |
# Qwen2.5-VL-3B-Instruct
|
@@ -17,6 +16,10 @@ base_model:
|
|
17 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
18 |
</a>
|
19 |
|
|
|
|
|
|
|
|
|
20 |
## Introduction
|
21 |
|
22 |
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
|
@@ -125,7 +128,7 @@ KeyError: 'qwen2_5_vl'
|
|
125 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
126 |
|
127 |
```bash
|
128 |
-
# It's highly
|
129 |
pip install qwen-vl-utils[decord]==0.0.8
|
130 |
```
|
131 |
|
@@ -136,7 +139,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
|
|
136 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
|
137 |
|
138 |
```python
|
139 |
-
from transformers import Qwen2_5_VLForConditionalGeneration,
|
140 |
from qwen_vl_utils import process_vision_info
|
141 |
|
142 |
# default: Load the model on the available device(s)
|
@@ -293,7 +296,6 @@ messages = [
|
|
293 |
}
|
294 |
]
|
295 |
|
296 |
-
#In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
|
297 |
# Preparation for inference
|
298 |
text = processor.apply_chat_template(
|
299 |
messages, tokenize=False, add_generation_prompt=True
|
@@ -382,7 +384,6 @@ print(output_texts)
|
|
382 |
### 🤖 ModelScope
|
383 |
We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
|
384 |
|
385 |
-
|
386 |
### More Usage Tips
|
387 |
|
388 |
For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
|
@@ -428,18 +429,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
|
|
428 |
min_pixels = 256 * 28 * 28
|
429 |
max_pixels = 1280 * 28 * 28
|
430 |
processor = AutoProcessor.from_pretrained(
|
431 |
-
"Qwen/Qwen2.5-VL-
|
432 |
)
|
433 |
```
|
434 |
|
435 |
Besides, We provide two methods for fine-grained control over the image size input to the model:
|
436 |
|
437 |
-
1.
|
438 |
-
|
439 |
-
2.
|
440 |
|
441 |
```python
|
442 |
-
#
|
443 |
messages = [
|
444 |
{
|
445 |
"role": "user",
|
@@ -454,7 +455,7 @@ messages = [
|
|
454 |
],
|
455 |
}
|
456 |
]
|
457 |
-
#
|
458 |
messages = [
|
459 |
{
|
460 |
"role": "user",
|
@@ -471,6 +472,78 @@ messages = [
|
|
471 |
]
|
472 |
```
|
473 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
474 |
### Processing Long Texts
|
475 |
|
476 |
The current `config.json` is set for context length up to 32,768 tokens.
|
@@ -494,34 +567,4 @@ For supported frameworks, you could add the following to `config.json` to enable
|
|
494 |
|
495 |
However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
|
496 |
|
497 |
-
At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
|
498 |
-
|
499 |
-
|
500 |
-
|
501 |
-
## Citation
|
502 |
-
|
503 |
-
If you find our work helpful, feel free to give us a cite.
|
504 |
-
|
505 |
-
```
|
506 |
-
@misc{qwen2.5-VL,
|
507 |
-
title = {Qwen2.5-VL},
|
508 |
-
url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
|
509 |
-
author = {Qwen Team},
|
510 |
-
month = {January},
|
511 |
-
year = {2025}
|
512 |
-
}
|
513 |
-
|
514 |
-
@article{Qwen2VL,
|
515 |
-
title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
|
516 |
-
author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
|
517 |
-
journal={arXiv preprint arXiv:2409.12191},
|
518 |
-
year={2024}
|
519 |
-
}
|
520 |
-
|
521 |
-
@article{Qwen-VL,
|
522 |
-
title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
|
523 |
-
author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
|
524 |
-
journal={arXiv preprint arXiv:2308.12966},
|
525 |
-
year={2023}
|
526 |
-
}
|
527 |
-
```
|
|
|
|
|
1 |
---
|
2 |
+
base_model:
|
3 |
+
- Qwen/Qwen2.5-VL-3B-Instruct
|
4 |
language:
|
5 |
- en
|
6 |
+
library_name: transformers
|
7 |
+
license_name: qwen-research
|
8 |
+
license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
|
9 |
pipeline_tag: image-text-to-text
|
10 |
tags:
|
11 |
- multimodal
|
|
|
|
|
|
|
12 |
---
|
13 |
|
14 |
# Qwen2.5-VL-3B-Instruct
|
|
|
16 |
<img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
|
17 |
</a>
|
18 |
|
19 |
+
Official Repo: https://github.com/QwenLM/Qwen2.5-VL
|
20 |
+
|
21 |
+
This model is presented in the paper [Qwen2.5-VL Technical Report](https://huggingface.co/papers/2502.13923).
|
22 |
+
|
23 |
## Introduction
|
24 |
|
25 |
In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
|
|
|
128 |
We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
|
129 |
|
130 |
```bash
|
131 |
+
# It's highly recommended to use `[decord]` feature for faster video loading.
|
132 |
pip install qwen-vl-utils[decord]==0.0.8
|
133 |
```
|
134 |
|
|
|
139 |
Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
|
140 |
|
141 |
```python
|
142 |
+
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
|
143 |
from qwen_vl_utils import process_vision_info
|
144 |
|
145 |
# default: Load the model on the available device(s)
|
|
|
296 |
}
|
297 |
]
|
298 |
|
|
|
299 |
# Preparation for inference
|
300 |
text = processor.apply_chat_template(
|
301 |
messages, tokenize=False, add_generation_prompt=True
|
|
|
384 |
### 🤖 ModelScope
|
385 |
We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
|
386 |
|
|
|
387 |
### More Usage Tips
|
388 |
|
389 |
For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
|
|
|
429 |
min_pixels = 256 * 28 * 28
|
430 |
max_pixels = 1280 * 28 * 28
|
431 |
processor = AutoProcessor.from_pretrained(
|
432 |
+
"Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
|
433 |
)
|
434 |
```
|
435 |
|
436 |
Besides, We provide two methods for fine-grained control over the image size input to the model:
|
437 |
|
438 |
+
1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
|
439 |
+
|
440 |
+
2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
|
441 |
|
442 |
```python
|
443 |
+
# resized_height and resized_width
|
444 |
messages = [
|
445 |
{
|
446 |
"role": "user",
|
|
|
455 |
],
|
456 |
}
|
457 |
]
|
458 |
+
# min_pixels and max_pixels
|
459 |
messages = [
|
460 |
{
|
461 |
"role": "user",
|
|
|
472 |
]
|
473 |
```
|
474 |
|
475 |
+
#### Add ids for Multiple Image Inputs
|
476 |
+
By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
|
477 |
+
<details>
|
478 |
+
<summary>Add vision ids</summary>
|
479 |
+
|
480 |
+
```python
|
481 |
+
conversation = [
|
482 |
+
{
|
483 |
+
"role": "user",
|
484 |
+
"content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
|
485 |
+
},
|
486 |
+
{
|
487 |
+
"role": "assistant",
|
488 |
+
"content": "I'm doing well, thank you for asking. How can I assist you today?",
|
489 |
+
},
|
490 |
+
{
|
491 |
+
"role": "user",
|
492 |
+
"content": [
|
493 |
+
{"type": "text", "text": "Can you describe these images and video?"},
|
494 |
+
{"type": "image"},
|
495 |
+
{"type": "image"},
|
496 |
+
{"type": "video"},
|
497 |
+
{"type": "text", "text": "These are from my vacation."},
|
498 |
+
],
|
499 |
+
},
|
500 |
+
{
|
501 |
+
"role": "assistant",
|
502 |
+
"content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
|
503 |
+
},
|
504 |
+
{
|
505 |
+
"role": "user",
|
506 |
+
"content": "It was a trip to the mountains. Can you see the details in the images and video?",
|
507 |
+
},
|
508 |
+
]
|
509 |
+
|
510 |
+
# default:
|
511 |
+
prompt_without_id = processor.apply_chat_template(
|
512 |
+
conversation, add_generation_prompt=True
|
513 |
+
)
|
514 |
+
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
|
515 |
+
|
516 |
+
|
517 |
+
# add ids
|
518 |
+
prompt_with_id = processor.apply_chat_template(
|
519 |
+
conversation, add_generation_prompt=True, add_vision_id=True
|
520 |
+
)
|
521 |
+
# Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
|
522 |
+
```
|
523 |
+
</details>
|
524 |
+
|
525 |
+
#### Flash-Attention 2 to speed up generation
|
526 |
+
|
527 |
+
First, make sure to install the latest version of Flash Attention 2:
|
528 |
+
|
529 |
+
```bash
|
530 |
+
pip install -U flash-attn --no-build-isolation
|
531 |
+
```
|
532 |
+
|
533 |
+
Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
|
534 |
+
|
535 |
+
To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
|
536 |
+
|
537 |
+
```python
|
538 |
+
from transformers import Qwen2_5_VLForConditionalGeneration
|
539 |
+
|
540 |
+
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
|
541 |
+
"Qwen/Qwen2.5-VL-7B-Instruct",
|
542 |
+
torch_dtype=torch.bfloat16,
|
543 |
+
attn_implementation="flash_attention_2",
|
544 |
+
)
|
545 |
+
```
|
546 |
+
|
547 |
### Processing Long Texts
|
548 |
|
549 |
The current `config.json` is set for context length up to 32,768 tokens.
|
|
|
567 |
|
568 |
However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
|
569 |
|
570 |
+
At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|