Qwen2.5-VL-7B-Instruct

This version of Qwen2.5-VL-7B-Instruct has been converted to run on the Axera NPU using w8a16 quantization.

This model has been optimized with the following LoRA:

Compatible with Pulsar2 version: 3.4

Convert tools links:

For those who are interested in model conversion, you can try to export axmodel through the original repo : https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

Pulsar2 Link, How to Convert LLM from Huggingface to axmodel

AXera NPU HOST LLM Runtime

AXera NPU AXCL LLM Runtime

Support Platform

AX650
- AX650N DEMO Board
- M4N-Dock(爱芯派Pro)
- M.2 Accelerator card

Image Process

Chips	input size	image num	image encoder	ttft(320 tokens)	w8a16	DDR	Flash
AX650	448*448	1	760 ms	3500 ms	2.0 tokens/sec	10.0 GiB	9.8 GiB

Video Process

Chips	input size	image num	image encoder	ttft(512 tokens)	w8a16	DDR	Flash
AX650	308*308	8	1500 ms	5080 ms	2.0 tokens/sec	10.0 GiB	9.8 GiB

The DDR capacity refers to the CMM memory that needs to be consumed. Ensure that the CMM memory allocation on the development board is greater than this value.

How to use

Download all files from this repository to the device

If you using AX650 Board

(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ tree -L 2
.
├── images
├── main_axcl_x86
├── post_config.json
├── Qwen2.5-VL-7B-Instruct-AX650-chunk_prefill_1280
│   ├── model.embed_tokens.weight.bfloat16.bin
│   ├── Qwen2.5-VL-7B-Instruct_vision.axmodel
│   ├── qwen2_5_vl_p128_l0_together.axmodel
......
│   └── qwen2_5_vl_post.axmodel
├── qwen2_5_vl_7b_tokenizer
├── qwen2_tokenizer_images.py
├── qwen2_tokenizer_video_308.py
├── README.md
├── run_qwen2_5vl_image.sh
├── run_qwen2_5vl_video.sh
└── video

Prepare tokenizer server

Install transformer

pip install transformers==4.41.1 jinja2

Demo Run

Image understand demo

start tokenizer server for image understand demo

python3 qwen2_tokenizer_images.py --port 12345

run image understand demo

input text

What are these attractions? Please give their names in Chinese and English

input image

(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_image.sh 
[I][                            Init][ 162]: LLM init start

[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151655, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32
[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> What are these attractions? Please give their names in Chinese and English
image >> images/attractions
images/attractions/recoAll_attractions_1.jpg
images/attractions/recoAll_attractions_2.jpg
images/attractions/recoAll_attractions_3.jpg
images/attractions/recoAll_attractions_4.jpg
[I][                          Encode][ 552]: image encode time : 3014.224121 ms, size : 4
[I][                          Encode][ 594]: input_ids size:1064
[I][                          Encode][ 602]: offset 15
[I][                          Encode][ 602]: offset 273
[I][                          Encode][ 602]: offset 531
[I][                          Encode][ 602]: offset 789
[I][                          Encode][ 624]: out_embed size:3813376
[I][                          Encode][ 626]: position_ids size:7982
[I][                             Run][ 645]: input token num : 1064, prefill_split_num : 9
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:128
[I][                             Run][ 679]: input_num_token:40
[I][                             Run][ 816]: ttft: 15817.47 ms
1. **金字塔 (Pyramids)**  
   - **英文**: Pyramids  
   - **位置**: ��及 (Egypt)

2. **长城 (Great Wall of China)**  
   - **英文**: Great Wall of China  
   - **位置**: 中国 (China)

3. **自由女神像 (Statute of Liberty)**  
   - **英文**: Statue of Liberty  
   - **位置**: 美国 (United States)

4. **兵马俑 (Terracotta Army)**  
   - **英文**: Terracotta Army  
   - **位置**: 中国 (China)

[N][                             Run][ 969]: hit eos,avg 2.05 token/s

Video understand demo

Please pre-process the image of the video file into a 308x308 size picture

start tokenizer server for image understand demo

python qwen2_tokenizer_video_308.py --port 12345

run video understand demo

(base) axera@dell:~/lhj/Qwen2.5-VL-7B-Instruct$ bash run_qwen2_5vl_video.sh 
[I][                            Init][ 162]: LLM init start
[I][                            Init][ 267]: IMAGE_CONTEXT_TOKEN: 151656, IMAGE_START_TOKEN: 151652
[I][                            Init][ 328]: image encoder output float32

[I][                            Init][ 340]: max_token_len : 2047
[I][                            Init][ 343]: kv_cache_size : 512, kv_cache_num: 2047
[I][                            Init][ 351]: prefill_token_num : 128
[I][                            Init][ 355]: grp: 1, prefill_max_token_num : 1
[I][                            Init][ 355]: grp: 2, prefill_max_token_num : 128
[I][                            Init][ 355]: grp: 3, prefill_max_token_num : 256
[I][                            Init][ 355]: grp: 4, prefill_max_token_num : 384
[I][                            Init][ 355]: grp: 5, prefill_max_token_num : 512
[I][                            Init][ 355]: grp: 6, prefill_max_token_num : 640
[I][                            Init][ 355]: grp: 7, prefill_max_token_num : 768
[I][                            Init][ 355]: grp: 8, prefill_max_token_num : 896
[I][                            Init][ 355]: grp: 9, prefill_max_token_num : 1024
[I][                            Init][ 355]: grp: 10, prefill_max_token_num : 1152
[I][                            Init][ 355]: grp: 11, prefill_max_token_num : 1280
[I][                            Init][ 359]: prefill_max_token_num : 1280
[I][                     load_config][ 282]: load config: 
{
    "enable_repetition_penalty": false,
    "enable_temperature": true,
    "enable_top_k_sampling": true,
    "enable_top_p_sampling": false,
    "penalty_window": 30,
    "repetition_penalty": 2,
    "temperature": 0.1,
    "top_k": 10,
    "top_p": 0.8
}

[I][                            Init][ 456]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
prompt >> 描述这个视频的内容
image >> video
video/frame_0000.jpg
video/frame_0008.jpg
video/frame_0016.jpg
video/frame_0024.jpg
video/frame_0032.jpg
video/frame_0040.jpg
video/frame_0048.jpg
video/frame_0056.jpg
[I][                          Encode][ 528]: pixel_values,size:4
[I][                          Encode][ 554]: image encode time : 1546.058960 ms, size : 4
[I][                          Encode][ 596]: input_ids size:509
[I][                          Encode][ 604]: offset 15
[I][                          Encode][ 620]: img_embed.size:4, 433664
[I][                          Encode][ 625]: offset:136
[I][                          Encode][ 625]: offset:257
[I][                          Encode][ 625]: offset:378
[I][                          Encode][ 634]: out_embed size:1824256
[I][                          Encode][ 636]: position_ids size:509
[I][                             Run][ 655]: input token num : 509, prefill_split_num : 4
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:128
[I][                             Run][ 689]: input_num_token:125
[I][                             Run][ 826]: ttft: 5081.97 ms
这张图片展示了两只土拨鼠在户外的山地环境中进行互动。它们似乎在进行一种类似打斗的行为，可能是在争夺领地或展示攻击性。背景是蓝天和山脉，环境看起来非常自然和开阔。土拨鼠的毛色主要是棕色和灰色，带有白色的斑纹。它们的姿势和动作显示出它们正在积极地互动。

[N][                             Run][ 979]: hit eos,avg 2.08 token/s

AXERA-TECH
/

Qwen2.5-VL-7B-Instruct

Qwen2.5-VL-7B-Instruct

Convert tools links:

Support Platform

How to use

Prepare tokenizer server

Install transformer

Demo Run

Image understand demo

start tokenizer server for image understand demo

run image understand demo

Video understand demo

start tokenizer server for image understand demo

run video understand demo

Model tree for AXERA-TECH/Qwen2.5-VL-7B-Instruct

Collections including AXERA-TECH/Qwen2.5-VL-7B-Instruct

Qwen2.5

Multimodal Models