README.md update
Browse files
README.md
CHANGED
@@ -1,3 +1,129 @@
|
|
1 |
-
---
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- de
|
5 |
+
- fr
|
6 |
+
- it
|
7 |
+
- pt
|
8 |
+
- hi
|
9 |
+
- es
|
10 |
+
- th
|
11 |
+
license: llama3.2
|
12 |
+
library_name: transformers
|
13 |
+
tags:
|
14 |
+
- autoround
|
15 |
+
- intel
|
16 |
+
- gptq
|
17 |
+
- woq
|
18 |
+
- meta
|
19 |
+
- pytorch
|
20 |
+
- llama
|
21 |
+
- llama-3
|
22 |
+
model_name: Llama 3.2 11B Vision Instruct
|
23 |
+
base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
|
24 |
+
inference: false
|
25 |
+
model_creator: meta-llama
|
26 |
+
pipeline_tag: text-generation
|
27 |
+
prompt_template: '{prompt}
|
28 |
+
'
|
29 |
+
quantized_by: fbaldassarri
|
30 |
+
---
|
31 |
+
|
32 |
+
## Model Information
|
33 |
+
|
34 |
+
Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference.
|
35 |
+
|
36 |
+
Model consists of 2 parts:
|
37 |
+
|
38 |
+
- **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
|
39 |
+
- **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.
|
40 |
+
|
41 |
+
Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.
|
42 |
+
|
43 |
+
Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml
|
44 |
+
|
45 |
+
- 4 bits (INT4)
|
46 |
+
- group size = 64
|
47 |
+
- Asymmetrical Quantization
|
48 |
+
- method AWQ
|
49 |
+
|
50 |
+
Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.
|
51 |
+
|
52 |
+
|
53 |
+
## Replication Recipe
|
54 |
+
|
55 |
+
### Step 1 Install Requirements
|
56 |
+
|
57 |
+
I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
|
58 |
+
|
59 |
+
'''
|
60 |
+
pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu
|
61 |
+
|
62 |
+
pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
|
63 |
+
|
64 |
+
pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
|
65 |
+
'''
|
66 |
+
|
67 |
+
### Step 2 Convert the model in OpenVINO Intermediate Representation (IR)
|
68 |
+
|
69 |
+
'''
|
70 |
+
from pathlib import Path
|
71 |
+
from ov_mllama_helper import convert_mllama
|
72 |
+
model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
|
73 |
+
model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
|
74 |
+
convert_mllama(model_id, model_dir)
|
75 |
+
'''
|
76 |
+
|
77 |
+
### Step 3 INT4 Compression
|
78 |
+
|
79 |
+
'''
|
80 |
+
from ov_mllama_compression import compress
|
81 |
+
from ov_mllama_compression import compression_widgets_helper
|
82 |
+
compression_scenario, compress_args = compression_widgets_helper()
|
83 |
+
compression_scenario
|
84 |
+
compression_kwargs = {key: value.value for key, value in compress_args.items()}
|
85 |
+
language_model_path = compress(model_dir, **compression_kwargs)
|
86 |
+
'''
|
87 |
+
|
88 |
+
### Step 4 INT8 Image Enconder Optimization
|
89 |
+
|
90 |
+
'''
|
91 |
+
from ov_mllama_compression import vision_encoder_selection_widget
|
92 |
+
vision_encoder_options = vision_encoder_selection_widget(device.value)
|
93 |
+
vision_encoder_options
|
94 |
+
from transformers import AutoProcessor
|
95 |
+
import nncf
|
96 |
+
import openvino as ov
|
97 |
+
import gc
|
98 |
+
from data_preprocessing import prepare_dataset_vision
|
99 |
+
processor = AutoProcessor.from_pretrained(model_dir)
|
100 |
+
core = ov.Core()
|
101 |
+
fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
|
102 |
+
int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
|
103 |
+
int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
|
104 |
+
calibration_data = prepare_dataset_vision(processor, 100)
|
105 |
+
ov_model = core.read_model(fp_vision_encoder_path)
|
106 |
+
calibration_dataset = nncf.Dataset(calibration_data)
|
107 |
+
quantized_model = nncf.quantize(
|
108 |
+
model=ov_model,
|
109 |
+
calibration_dataset=calibration_dataset,
|
110 |
+
model_type=nncf.ModelType.TRANSFORMER,
|
111 |
+
advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
|
112 |
+
)
|
113 |
+
ov.save_model(quantized_model, int8_vision_encoder_path)
|
114 |
+
del quantized_model
|
115 |
+
del ov_model
|
116 |
+
del calibration_dataset
|
117 |
+
del calibration_data
|
118 |
+
gc.collect()
|
119 |
+
vision_encoder_path = int8_vision_encoder_path
|
120 |
+
'''
|
121 |
+
|
122 |
+
## License
|
123 |
+
|
124 |
+
[Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)
|
125 |
+
|
126 |
+
## Disclaimer
|
127 |
+
|
128 |
+
This quantized model comes with no warrenty. It has been developed only for research purposes.
|
129 |
+
|