File size: 2,117 Bytes
d6f1700 dc54a36 d6f1700 dc54a36 51f432b d513d71 dc54a36 d513d71 af9a2ee d513d71 dc54a36 d513d71 dc54a36 d513d71 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 |
---
base_model:
- Qwen/Qwen2.5-VL-7B-Instruct
license: apache-2.0
---
## Model Details
This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.
## Run The Model
this pr https://github.com/vllm-project/vllm/pull/21802 is required.
~~~bash
vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound --dtype bfloat16 --port 8001 --max-model-len 10000
~~~
~~~bash
curl --noproxy '*' http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
}
},
{
"type": "text",
"text": "请描述这张图"
}
]
}
],
"max_tokens": 512
}'
~~~
## Generate the model
~~~python
import torch
from auto_round import AutoRound, AutoRoundMLLM
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
model_name = "Qwen/Qwen2.5-VL-7B-Instruct/"
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name, torch_dtype="auto", device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
layer_config = {}
for n, m in model.named_modules():
if "visual" in n:
if not isinstance(m, torch.nn.Linear):
continue
if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n:
layer_config[n] = {"bits": 16}
else:
layer_config[n] = {"bits": 8}
autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config)
autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround)
~~~ |