fbaldassarri's picture
Upload README.md
c2843e7 verified
---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.2
library_name: transformers
tags:
- autoround
- intel
- gptq
- woq
- meta
- pytorch
- llama
- llama-3
model_name: Llama 3.2 11B Vision Instruct
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
inference: false
model_creator: meta-llama
pipeline_tag: text-generation
prompt_template: '{prompt}
'
quantized_by: fbaldassarri
---
## Model Information
Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference.
Model consists of 2 parts:
- **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
- **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.
Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.
Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml
- 4 bits (INT4)
- group size = 64
- Asymmetrical Quantization
- method AWQ
Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.
## Replication Recipe
### Step 1 Install Requirements
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
```
pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
```
### Step 2 Convert the model in OpenVINO Intermediate Representation (IR)
```
from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
convert_mllama(model_id, model_dir)
```
### Step 3 INT4 Compression
```
from ov_mllama_compression import compress
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)
```
### Step 4 INT8 Image Enconder Optimization
```
from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget(device.value)
vision_encoder_options
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
)
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path
```
## License
[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)
## Disclaimer
This quantized model comes with no warrenty. It has been developed only for research purposes.