--- base_model: - Qwen/Qwen2.5-VL-7B-Instruct license: apache-2.0 --- ## Model Details This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision. ## Run The Model this pr https://github.com/vllm-project/vllm/pull/21802 is required. ~~~bash vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound --dtype bfloat16 --port 8001 --max-model-len 10000 ~~~ ~~~bash curl --noproxy '*' http://localhost:8001/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg" } }, { "type": "text", "text": "请描述这张图" } ] } ], "max_tokens": 512 }' ~~~ ## Generate the model ~~~python import torch from auto_round import AutoRound, AutoRoundMLLM from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor model_name = "Qwen/Qwen2.5-VL-7B-Instruct/" # default: Load the model on the available device(s) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_name, torch_dtype="auto", device_map="auto" ) tokenizer = AutoTokenizer.from_pretrained(model_name) processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True) layer_config = {} for n, m in model.named_modules(): if "visual" in n: if not isinstance(m, torch.nn.Linear): continue if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n: layer_config[n] = {"bits": 16} else: layer_config[n] = {"bits": 8} autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config) autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround) ~~~