README.md · fbaldassarri/meta-llama_Llama-3.2-11B-Vision-Instruct-OpenVino at 925ac1b60ae6eb9ae160e2927539e860c219913e

metadata

language:
  - en
  - de
  - fr
  - it
  - pt
  - hi
  - es
  - th
license: llama3.2
library_name: transformers
tags:
  - autoround
  - intel
  - gptq
  - woq
  - meta
  - pytorch
  - llama
  - llama-3
model_name: Llama 3.2 11B Vision Instruct
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
inference: false
model_creator: meta-llama
pipeline_tag: text-generation
prompt_template: '{prompt} '
quantized_by: fbaldassarri

Model Information

Converted version of meta-llama/Llama-3.2-11B-Vision-Instruct to OpenVINO Intermediate Representation (IR) for CPU devices inference.

Model consists of 2 parts:

Image Encoder, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
Language Model, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.

Then, for reducing memory consumption, weights compression optimization has applied using Neural Network Compression Framework (NNCF) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.

Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml

4 bits (INT4)
group size = 64
Asymmetrical Quantization
method AWQ

Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.

Replication Recipe

Step 1 Install Requirements

I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.

''' pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu

pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu

pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly '''

Step 2 Convert the model in OpenVINO Intermediate Representation (IR)

''' from pathlib import Path from ov_mllama_helper import convert_mllama model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" model_dir = Path(model_id.split("/")[-1]) / "OpenVino" convert_mllama(model_id, model_dir) '''

Step 3 INT4 Compression

''' from ov_mllama_compression import compress from ov_mllama_compression import compression_widgets_helper compression_scenario, compress_args = compression_widgets_helper() compression_scenario compression_kwargs = {key: value.value for key, value in compress_args.items()} language_model_path = compress(model_dir, **compression_kwargs) '''

Step 4 INT8 Image Enconder Optimization

''' from ov_mllama_compression import vision_encoder_selection_widget vision_encoder_options = vision_encoder_selection_widget(device.value) vision_encoder_options from transformers import AutoProcessor import nncf import openvino as ov import gc from data_preprocessing import prepare_dataset_vision processor = AutoProcessor.from_pretrained(model_dir) core = ov.Core() fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml" int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml") int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml") calibration_data = prepare_dataset_vision(processor, 100) ov_model = core.read_model(fp_vision_encoder_path) calibration_dataset = nncf.Dataset(calibration_data) quantized_model = nncf.quantize( model=ov_model, calibration_dataset=calibration_dataset, model_type=nncf.ModelType.TRANSFORMER, advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6), ) ov.save_model(quantized_model, int8_vision_encoder_path) del quantized_model del ov_model del calibration_dataset del calibration_data gc.collect() vision_encoder_path = int8_vision_encoder_path '''

License

Llama 3.2 Community License

Disclaimer

This quantized model comes with no warrenty. It has been developed only for research purposes.