--- tags: - w4a16 - int4 - vllm - vision license: apache-2.0 license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md language: - en base_model: meta-llama/Llama-3.2-11B-Vision-Instruct library_name: transformers --- # Llama-3.2-11B-Vision-Instruct-quantized.w4a16 ## Model Overview - **Model Architecture:** Llama-3.2-11B-Vision-Instruct - **Input:** Vision-Text - **Output:** Text - **Model Optimizations:** - **Weight quantization:** INT4 - **Activation quantization:** FP16 - **Release Date:** 1/31/2025 - **Version:** 1.0 - **Model Developers:** Neural Magic Quantized version of [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct). ### Model Optimizations This model was obtained by quantizing the weights of [meta-llama/Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) to INT4 data type, ready for inference with vLLM >= 0.5.2. ## Deployment ### Use with vLLM This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from transformers import AutoProcessor from vllm.assets.image import ImageAsset from vllm import LLM, SamplingParams # prepare model model_id = "neuralmagic/Llama-3.2-11B-Vision-Instruct-quantized.w4a16" llm = LLM( model=model_id, max_model_len=4096, max_num_seqs=16, limit_mm_per_prompt={"image": 1}, ) processor = AutoProcessor.from_pretrained(model_id) # prepare inputs question = "What is the content of this image?" messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": f"{question}"}, ], }, ] prompt = processor.apply_chat_template( messages, add_generation_prompt=True,tokenize=False ) image = ImageAsset("cherry_blossom").pil_image.convert("RGB") inputs = { "prompt": prompt, "multi_modal_data": { "image": image }, } # generate response print("========== SAMPLE GENERATION ==============") outputs = llm.generate(inputs, SamplingParams(temperature=0.2, max_tokens=64)) print(f"PROMPT : {outputs[0].prompt}") print(f"RESPONSE: {outputs[0].outputs[0].text}") print("==========================================") ``` vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details. ## Creation This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below as part a multimodal announcement blog. ```python import requests import torch from PIL import Image from transformers import AutoProcessor from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.transformers import oneshot from llmcompressor.transformers.tracing import TraceableMllamaForConditionalGeneration # Load model. model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" model = TraceableMllamaForConditionalGeneration.from_pretrained( model_id, device_map="auto", torch_dtype="auto" ) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) # Oneshot arguments DATASET_ID = "flickr30k" DATASET_SPLIT = {"calibration": "test[:512]"} NUM_CALIBRATION_SAMPLES = 512 MAX_SEQUENCE_LENGTH = 2048 # Define a oneshot data collator for multimodal inputs. def data_collator(batch): assert len(batch) == 1 return {key: torch.tensor(value) for key, value in batch[0].items()} # Recipe recipe = [ GPTQModifier( targets="Linear", scheme="W4A16", ignore=["re:.*lm_head", "re:multi_modal_projector.*", "re:vision_model.*"], ), ] # Perform oneshot oneshot( model=model, tokenizer=model_id, dataset=DATASET_ID, splits=DATASET_SPLIT, recipe=recipe, max_seq_length=MAX_SEQUENCE_LENGTH, num_calibration_samples=NUM_CALIBRATION_SAMPLES, trust_remote_code_model=True, data_collator=data_collator, ) ``` ## License License: Use of Llama 3.2 is governed by the [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) (a custom, commercial license agreement).