File size: 2,117 Bytes
d6f1700
 
 
dc54a36
d6f1700
dc54a36
 
51f432b
d513d71
 
dc54a36
 
 
 
d513d71
 
af9a2ee
d513d71
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
dc54a36
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
d513d71
dc54a36
 
d513d71
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
license: apache-2.0
---

## Model Details
This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.


## Run The Model


this pr https://github.com/vllm-project/vllm/pull/21802 is required.

 ~~~bash
 vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound  --dtype bfloat16 --port 8001  --max-model-len 10000
~~~

~~~bash
curl --noproxy '*'   http://localhost:8001/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            }
          },
          {
            "type": "text",
            "text": "请描述这张图"
          }
        ]
      }
    ],
    "max_tokens": 512
  }'
~~~



## Generate the model

~~~python
import torch
from auto_round import AutoRound, AutoRoundMLLM
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor

model_name = "Qwen/Qwen2.5-VL-7B-Instruct/"

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
layer_config = {}
for n, m in model.named_modules():
    if "visual" in n:
        if not isinstance(m, torch.nn.Linear):
            continue
        if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n:
            layer_config[n] = {"bits": 16}
        else:
            layer_config[n] = {"bits": 8}

autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config)
autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround)
~~~