File size: 4,262 Bytes
925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b c2843e7 925ac1b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 |
---
language:
- en
- de
- fr
- it
- pt
- hi
- es
- th
license: llama3.2
library_name: transformers
tags:
- autoround
- intel
- gptq
- woq
- meta
- pytorch
- llama
- llama-3
model_name: Llama 3.2 11B Vision Instruct
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
inference: false
model_creator: meta-llama
pipeline_tag: text-generation
prompt_template: '{prompt}
'
quantized_by: fbaldassarri
---
## Model Information
Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference.
Model consists of 2 parts:
- **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
- **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.
Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.
Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml
- 4 bits (INT4)
- group size = 64
- Asymmetrical Quantization
- method AWQ
Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.
## Replication Recipe
### Step 1 Install Requirements
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
```
pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
```
### Step 2 Convert the model in OpenVINO Intermediate Representation (IR)
```
from pathlib import Path
from ov_mllama_helper import convert_mllama
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
convert_mllama(model_id, model_dir)
```
### Step 3 INT4 Compression
```
from ov_mllama_compression import compress
from ov_mllama_compression import compression_widgets_helper
compression_scenario, compress_args = compression_widgets_helper()
compression_scenario
compression_kwargs = {key: value.value for key, value in compress_args.items()}
language_model_path = compress(model_dir, **compression_kwargs)
```
### Step 4 INT8 Image Enconder Optimization
```
from ov_mllama_compression import vision_encoder_selection_widget
vision_encoder_options = vision_encoder_selection_widget(device.value)
vision_encoder_options
from transformers import AutoProcessor
import nncf
import openvino as ov
import gc
from data_preprocessing import prepare_dataset_vision
processor = AutoProcessor.from_pretrained(model_dir)
core = ov.Core()
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
calibration_data = prepare_dataset_vision(processor, 100)
ov_model = core.read_model(fp_vision_encoder_path)
calibration_dataset = nncf.Dataset(calibration_data)
quantized_model = nncf.quantize(
model=ov_model,
calibration_dataset=calibration_dataset,
model_type=nncf.ModelType.TRANSFORMER,
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
)
ov.save_model(quantized_model, int8_vision_encoder_path)
del quantized_model
del ov_model
del calibration_dataset
del calibration_data
gc.collect()
vision_encoder_path = int8_vision_encoder_path
```
## License
[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)
## Disclaimer
This quantized model comes with no warrenty. It has been developed only for research purposes.
|