|
--- |
|
language: |
|
- en |
|
- de |
|
- fr |
|
- it |
|
- pt |
|
- hi |
|
- es |
|
- th |
|
license: llama3.2 |
|
library_name: transformers |
|
tags: |
|
- autoround |
|
- intel |
|
- gptq |
|
- woq |
|
- meta |
|
- pytorch |
|
- llama |
|
- llama-3 |
|
model_name: Llama 3.2 11B Vision Instruct |
|
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct |
|
inference: false |
|
model_creator: meta-llama |
|
pipeline_tag: text-generation |
|
prompt_template: '{prompt} |
|
' |
|
quantized_by: fbaldassarri |
|
--- |
|
|
|
## Model Information |
|
|
|
Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference. |
|
|
|
Model consists of 2 parts: |
|
|
|
- **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space; |
|
- **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens. |
|
|
|
Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs. |
|
|
|
Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml |
|
|
|
- 4 bits (INT4) |
|
- group size = 64 |
|
- Asymmetrical Quantization |
|
- method AWQ |
|
|
|
Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml. |
|
|
|
|
|
## Replication Recipe |
|
|
|
### Step 1 Install Requirements |
|
|
|
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment. |
|
|
|
``` |
|
pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu |
|
|
|
pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu |
|
|
|
pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly |
|
``` |
|
|
|
### Step 2 Convert the model in OpenVINO Intermediate Representation (IR) |
|
|
|
``` |
|
from pathlib import Path |
|
from ov_mllama_helper import convert_mllama |
|
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct" |
|
model_dir = Path(model_id.split("/")[-1]) / "OpenVino" |
|
convert_mllama(model_id, model_dir) |
|
``` |
|
|
|
### Step 3 INT4 Compression |
|
|
|
``` |
|
from ov_mllama_compression import compress |
|
from ov_mllama_compression import compression_widgets_helper |
|
compression_scenario, compress_args = compression_widgets_helper() |
|
compression_scenario |
|
compression_kwargs = {key: value.value for key, value in compress_args.items()} |
|
language_model_path = compress(model_dir, **compression_kwargs) |
|
``` |
|
|
|
### Step 4 INT8 Image Enconder Optimization |
|
|
|
``` |
|
from ov_mllama_compression import vision_encoder_selection_widget |
|
vision_encoder_options = vision_encoder_selection_widget(device.value) |
|
vision_encoder_options |
|
from transformers import AutoProcessor |
|
import nncf |
|
import openvino as ov |
|
import gc |
|
from data_preprocessing import prepare_dataset_vision |
|
processor = AutoProcessor.from_pretrained(model_dir) |
|
core = ov.Core() |
|
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml" |
|
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml") |
|
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml") |
|
calibration_data = prepare_dataset_vision(processor, 100) |
|
ov_model = core.read_model(fp_vision_encoder_path) |
|
calibration_dataset = nncf.Dataset(calibration_data) |
|
quantized_model = nncf.quantize( |
|
model=ov_model, |
|
calibration_dataset=calibration_dataset, |
|
model_type=nncf.ModelType.TRANSFORMER, |
|
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6), |
|
) |
|
ov.save_model(quantized_model, int8_vision_encoder_path) |
|
del quantized_model |
|
del ov_model |
|
del calibration_dataset |
|
del calibration_data |
|
gc.collect() |
|
vision_encoder_path = int8_vision_encoder_path |
|
``` |
|
|
|
## License |
|
|
|
[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE) |
|
|
|
## Disclaimer |
|
|
|
This quantized model comes with no warrenty. It has been developed only for research purposes. |
|
|
|
|