Model Details

This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.

Run The Model

this pr https://github.com/vllm-project/vllm/pull/21802 is required.

 vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound  --dtype bfloat16 --port 8001  --max-model-len 10000
curl --noproxy '*'   http://localhost:8001/v1/chat/completions   -H "Content-Type: application/json"   -d '{
    "model": "Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound",
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": {
              "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
            }
          },
          {
            "type": "text",
            "text": "璇锋弿杩拌繖寮犲浘"
          }
        ]
      }
    ],
    "max_tokens": 512
  }'

Generate the model

import torch
from auto_round import AutoRound, AutoRoundMLLM
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor

model_name = "Qwen/Qwen2.5-VL-7B-Instruct/"

# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name, torch_dtype="auto", device_map="auto"
)

tokenizer = AutoTokenizer.from_pretrained(model_name)

processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
layer_config = {}
for n, m in model.named_modules():
    if "visual" in n:
        if not isinstance(m, torch.nn.Linear):
            continue
        if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n:
            layer_config[n] = {"bits": 16}
        else:
            layer_config[n] = {"bits": 8}

autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config)
autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround")
Downloads last month
17
Safetensors
Model size
1.51B params
Tensor type
I32
BF16
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound

Quantized
(91)
this model