--- language: - en - de - fr - it - pt - hi - es - th license: llama3.2 library_name: transformers tags: - autoround - intel - gptq - woq - meta - pytorch - llama - llama-3 model_name: Llama 3.2 11B Vision Instruct base_model: meta-llama/Llama-3.2-11B-Vision-Instruct inference: false model_creator: meta-llama pipeline_tag: text-generation prompt_template: '{prompt} ' quantized_by: fbaldassarri --- ## Model Information Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference. Model consists of 2 parts: - **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space; - **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens. Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml - 4 bits (INT4) - group size = 64 - Asymmetrical Quantization - method AWQ Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml. ## Replication Recipe ### Step 1 Install Requirements I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment. ``` pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly ``` ### Step 2 Convert the model in OpenVINO Intermediate Representation (IR) ``` from pathlib import Path from ov_mllama_helper import convert_mllama model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" model_dir = Path(model_id.split("/")[-1]) / "OpenVino" convert_mllama(model_id, model_dir) ``` ### Step 3 INT4 Compression ``` from ov_mllama_compression import compress from ov_mllama_compression import compression_widgets_helper compression_scenario, compress_args = compression_widgets_helper() compression_scenario compression_kwargs = {key: value.value for key, value in compress_args.items()} language_model_path = compress(model_dir, **compression_kwargs) ``` ### Step 4 INT8 Image Enconder Optimization ``` from ov_mllama_compression import vision_encoder_selection_widget vision_encoder_options = vision_encoder_selection_widget(device.value) vision_encoder_options from transformers import AutoProcessor import nncf import openvino as ov import gc from data_preprocessing import prepare_dataset_vision processor = AutoProcessor.from_pretrained(model_dir) core = ov.Core() fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml" int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml") int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml") calibration_data = prepare_dataset_vision(processor, 100) ov_model = core.read_model(fp_vision_encoder_path) calibration_dataset = nncf.Dataset(calibration_data) quantized_model = nncf.quantize( model=ov_model, calibration_dataset=calibration_dataset, model_type=nncf.ModelType.TRANSFORMER, advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6), ) ov.save_model(quantized_model, int8_vision_encoder_path) del quantized_model del ov_model del calibration_dataset del calibration_data gc.collect() vision_encoder_path = int8_vision_encoder_path ``` ## License [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) ## Disclaimer This quantized model comes with no warrenty. It has been developed only for research purposes.