nielsr HF Staff commited on
Commit
fac9dbf
·
verified ·
1 Parent(s): 1b989f2

Add project page, Github repo and paper

Browse files

This PR improves the model card, by adding a link to:

- the project page: https://chat.qwenlm.ai
- the Github repository: https://github.com/QwenLM/Qwen2.5-VL
- the paper: https://hf.co/papers/2502.13923

Files changed (1) hide show
  1. README.md +90 -47
README.md CHANGED
@@ -1,15 +1,14 @@
1
-
2
  ---
3
- license_name: qwen-research
4
- license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
5
  language:
6
  - en
 
 
 
7
  pipeline_tag: image-text-to-text
8
  tags:
9
  - multimodal
10
- library_name: transformers
11
- base_model:
12
- - Qwen/Qwen2.5-VL-3B-Instruct
13
  ---
14
 
15
  # Qwen2.5-VL-3B-Instruct
@@ -17,6 +16,10 @@ base_model:
17
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
18
  </a>
19
 
 
 
 
 
20
  ## Introduction
21
 
22
  In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
@@ -125,7 +128,7 @@ KeyError: 'qwen2_5_vl'
125
  We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
126
 
127
  ```bash
128
- # It's highly recommanded to use `[decord]` feature for faster video loading.
129
  pip install qwen-vl-utils[decord]==0.0.8
130
  ```
131
 
@@ -136,7 +139,7 @@ If you are not using Linux, you might not be able to install `decord` from PyPI.
136
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
137
 
138
  ```python
139
- from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
140
  from qwen_vl_utils import process_vision_info
141
 
142
  # default: Load the model on the available device(s)
@@ -293,7 +296,6 @@ messages = [
293
  }
294
  ]
295
 
296
- #In Qwen 2.5 VL, frame rate information is also input into the model to align with absolute time.
297
  # Preparation for inference
298
  text = processor.apply_chat_template(
299
  messages, tokenize=False, add_generation_prompt=True
@@ -382,7 +384,6 @@ print(output_texts)
382
  ### 🤖 ModelScope
383
  We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
384
 
385
-
386
  ### More Usage Tips
387
 
388
  For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
@@ -428,18 +429,18 @@ The model supports a wide range of resolution inputs. By default, it uses the na
428
  min_pixels = 256 * 28 * 28
429
  max_pixels = 1280 * 28 * 28
430
  processor = AutoProcessor.from_pretrained(
431
- "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
432
  )
433
  ```
434
 
435
  Besides, We provide two methods for fine-grained control over the image size input to the model:
436
 
437
- 1. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
438
-
439
- 2. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
440
 
441
  ```python
442
- # min_pixels and max_pixels
443
  messages = [
444
  {
445
  "role": "user",
@@ -454,7 +455,7 @@ messages = [
454
  ],
455
  }
456
  ]
457
- # resized_height and resized_width
458
  messages = [
459
  {
460
  "role": "user",
@@ -471,6 +472,78 @@ messages = [
471
  ]
472
  ```
473
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
474
  ### Processing Long Texts
475
 
476
  The current `config.json` is set for context length up to 32,768 tokens.
@@ -494,34 +567,4 @@ For supported frameworks, you could add the following to `config.json` to enable
494
 
495
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
496
 
497
- At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.
498
-
499
-
500
-
501
- ## Citation
502
-
503
- If you find our work helpful, feel free to give us a cite.
504
-
505
- ```
506
- @misc{qwen2.5-VL,
507
- title = {Qwen2.5-VL},
508
- url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
509
- author = {Qwen Team},
510
- month = {January},
511
- year = {2025}
512
- }
513
-
514
- @article{Qwen2VL,
515
- title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
516
- author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
517
- journal={arXiv preprint arXiv:2409.12191},
518
- year={2024}
519
- }
520
-
521
- @article{Qwen-VL,
522
- title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
523
- author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
524
- journal={arXiv preprint arXiv:2308.12966},
525
- year={2023}
526
- }
527
- ```
 
 
1
  ---
2
+ base_model:
3
+ - Qwen/Qwen2.5-VL-3B-Instruct
4
  language:
5
  - en
6
+ library_name: transformers
7
+ license_name: qwen-research
8
+ license_link: https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct/blob/main/LICENSE
9
  pipeline_tag: image-text-to-text
10
  tags:
11
  - multimodal
 
 
 
12
  ---
13
 
14
  # Qwen2.5-VL-3B-Instruct
 
16
  <img alt="Chat" src="https://img.shields.io/badge/%F0%9F%92%9C%EF%B8%8F%20Qwen%20Chat%20-536af5" style="display: inline-block; vertical-align: middle;"/>
17
  </a>
18
 
19
+ Official Repo: https://github.com/QwenLM/Qwen2.5-VL
20
+
21
+ This model is presented in the paper [Qwen2.5-VL Technical Report](https://huggingface.co/papers/2502.13923).
22
+
23
  ## Introduction
24
 
25
  In the past five months since Qwen2-VL’s release, numerous developers have built new models on the Qwen2-VL vision-language models, providing us with valuable feedback. During this period, we focused on building more useful vision-language models. Today, we are excited to introduce the latest addition to the Qwen family: Qwen2.5-VL.
 
128
  We offer a toolkit to help you handle various types of visual input more conveniently, as if you were using an API. This includes base64, URLs, and interleaved images and videos. You can install it using the following command:
129
 
130
  ```bash
131
+ # It's highly recommended to use `[decord]` feature for faster video loading.
132
  pip install qwen-vl-utils[decord]==0.0.8
133
  ```
134
 
 
139
  Here we show a code snippet to show you how to use the chat model with `transformers` and `qwen_vl_utils`:
140
 
141
  ```python
142
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
143
  from qwen_vl_utils import process_vision_info
144
 
145
  # default: Load the model on the available device(s)
 
296
  }
297
  ]
298
 
 
299
  # Preparation for inference
300
  text = processor.apply_chat_template(
301
  messages, tokenize=False, add_generation_prompt=True
 
384
  ### 🤖 ModelScope
385
  We strongly advise users especially those in mainland China to use ModelScope. `snapshot_download` can help you solve issues concerning downloading checkpoints.
386
 
 
387
  ### More Usage Tips
388
 
389
  For input images, we support local files, base64, and URLs. For videos, we currently only support local files.
 
429
  min_pixels = 256 * 28 * 28
430
  max_pixels = 1280 * 28 * 28
431
  processor = AutoProcessor.from_pretrained(
432
+ "Qwen/Qwen2.5-VL-7B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
433
  )
434
  ```
435
 
436
  Besides, We provide two methods for fine-grained control over the image size input to the model:
437
 
438
+ 1. Specify exact dimensions: Directly set `resized_height` and `resized_width`. These values will be rounded to the nearest multiple of 28.
439
+
440
+ 2. Define min_pixels and max_pixels: Images will be resized to maintain their aspect ratio within the range of min_pixels and max_pixels.
441
 
442
  ```python
443
+ # resized_height and resized_width
444
  messages = [
445
  {
446
  "role": "user",
 
455
  ],
456
  }
457
  ]
458
+ # min_pixels and max_pixels
459
  messages = [
460
  {
461
  "role": "user",
 
472
  ]
473
  ```
474
 
475
+ #### Add ids for Multiple Image Inputs
476
+ By default, images and video content are directly included in the conversation. When handling multiple images, it's helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:
477
+ <details>
478
+ <summary>Add vision ids</summary>
479
+
480
+ ```python
481
+ conversation = [
482
+ {
483
+ "role": "user",
484
+ "content": [{"type": "image"}, {"type": "text", "text": "Hello, how are you?"}],
485
+ },
486
+ {
487
+ "role": "assistant",
488
+ "content": "I'm doing well, thank you for asking. How can I assist you today?",
489
+ },
490
+ {
491
+ "role": "user",
492
+ "content": [
493
+ {"type": "text", "text": "Can you describe these images and video?"},
494
+ {"type": "image"},
495
+ {"type": "image"},
496
+ {"type": "video"},
497
+ {"type": "text", "text": "These are from my vacation."},
498
+ ],
499
+ },
500
+ {
501
+ "role": "assistant",
502
+ "content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?",
503
+ },
504
+ {
505
+ "role": "user",
506
+ "content": "It was a trip to the mountains. Can you see the details in the images and video?",
507
+ },
508
+ ]
509
+
510
+ # default:
511
+ prompt_without_id = processor.apply_chat_template(
512
+ conversation, add_generation_prompt=True
513
+ )
514
+ # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|><|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
515
+
516
+
517
+ # add ids
518
+ prompt_with_id = processor.apply_chat_template(
519
+ conversation, add_generation_prompt=True, add_vision_id=True
520
+ )
521
+ # Excepted output: '<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nPicture 1: <|vision_start|><|image_pad|><|vision_end|>Hello, how are you?<|im_end|>\n<|im_start|>assistant\nI'm doing well, thank you for asking. How can I assist you today?<|im_end|>\n<|im_start|>user\nCan you describe these images and video?Picture 2: <|vision_start|><|image_pad|><|vision_end|>Picture 3: <|vision_start|><|image_pad|><|vision_end|>Video 1: <|vision_start|><|video_pad|><|vision_end|>These are from my vacation.<|im_end|>\n<|im_start|>assistant\nI'd be happy to describe the images and video for you. Could you please provide more context about your vacation?<|im_end|>\n<|im_start|>user\nIt was a trip to the mountains. Can you see the details in the images and video?<|im_end|>\n<|im_start|>assistant\n'
522
+ ```
523
+ </details>
524
+
525
+ #### Flash-Attention 2 to speed up generation
526
+
527
+ First, make sure to install the latest version of Flash Attention 2:
528
+
529
+ ```bash
530
+ pip install -U flash-attn --no-build-isolation
531
+ ```
532
+
533
+ Also, you should have a hardware that is compatible with Flash-Attention 2. Read more about it in the official documentation of the [flash attention repository](https://github.com/Dao-AILab/flash-attention). FlashAttention-2 can only be used when a model is loaded in `torch.float16` or `torch.bfloat16`.
534
+
535
+ To load and run a model using Flash Attention-2, simply add `attn_implementation="flash_attention_2"` when loading the model as follows:
536
+
537
+ ```python
538
+ from transformers import Qwen2_5_VLForConditionalGeneration
539
+
540
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
541
+ "Qwen/Qwen2.5-VL-7B-Instruct",
542
+ torch_dtype=torch.bfloat16,
543
+ attn_implementation="flash_attention_2",
544
+ )
545
+ ```
546
+
547
  ### Processing Long Texts
548
 
549
  The current `config.json` is set for context length up to 32,768 tokens.
 
567
 
568
  However, it should be noted that this method has a significant impact on the performance of temporal and spatial localization tasks, and is therefore not recommended for use.
569
 
570
+ At the same time, for long video inputs, since MRoPE itself is more economical with ids, the max_position_embeddings can be directly modified to a larger value, such as 64k.