Intel
/

Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound

4-bit precision

Model card Files Files and versions

wenhuach commited on 7 days ago

Commit

dc54a36

·

verified ·

1 Parent(s): af9a2ee

Update README.md

Files changed (1) hide show

README.md +39 -1

README.md CHANGED Viewed

@@ -1,11 +1,17 @@
 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
 ---
 This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.
-this pr https://github.com/vllm-project/vllm/pull/21802 is required.
  ~~~bash
  vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound  --dtype bfloat16 --port 8001  --max-model-len 10000
@@ -33,5 +39,37 @@ curl --noproxy '*'   http://localhost:8001/v1/chat/completions   -H "Content-Typ
     ],
     "max_tokens": 512
   }'
 ~~~

 ---
 base_model:
 - Qwen/Qwen2.5-VL-7B-Instruct
+license: apache-2.0
 ---
+## Model Details
 This is an example model demonstrating how to run the AutoRound format for a visual language model on vLLM. Some visual modules have been quantized to 8-bit precision.
+## Run The Model
+this pr https://github.com/vllm-project/vllm/pull/21802 is required.
  ~~~bash
  vllm serve Intel/Qwen2.5-VL-7B-Instruct-int4-mixed-AutoRound  --dtype bfloat16 --port 8001  --max-model-len 10000
     ],
     "max_tokens": 512
   }'
+~~~
+## Generate the model
+~~~python
+import torch
+from auto_round import AutoRound, AutoRoundMLLM
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+model_name = "Qwen/Qwen2.5-VL-7B-Instruct/"
+# default: Load the model on the available device(s)
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_name, torch_dtype="auto", device_map="auto"
+)
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+processor = AutoProcessor.from_pretrained(model_name,trust_remote_code=True)
+layer_config = {}
+for n, m in model.named_modules():
+    if "visual" in n:
+        if not isinstance(m, torch.nn.Linear):
+            continue
+        if "mlp.gate_proj" in n or "mlp.down_proj" in n or "mlp.up_proj" in n:
+            layer_config[n] = {"bits": 16}
+        else:
+            layer_config[n] = {"bits": 8}
+autoround = AutoRoundMLLM(model, tokenizer, processor=processor, iters=200, group_size=128,layer_config=layer_config)
+autoround.quantize_and_save("./Qwen2.5-VL-7B-Instruct-autoround)
 ~~~