fbaldassarri commited on
Commit
925ac1b
·
verified ·
1 Parent(s): 299b1dd

README.md update

Browse files
Files changed (1) hide show
  1. README.md +129 -3
README.md CHANGED
@@ -1,3 +1,129 @@
1
- ---
2
- license: llama3.2
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - de
5
+ - fr
6
+ - it
7
+ - pt
8
+ - hi
9
+ - es
10
+ - th
11
+ license: llama3.2
12
+ library_name: transformers
13
+ tags:
14
+ - autoround
15
+ - intel
16
+ - gptq
17
+ - woq
18
+ - meta
19
+ - pytorch
20
+ - llama
21
+ - llama-3
22
+ model_name: Llama 3.2 11B Vision Instruct
23
+ base_model: meta-llama/Llama-3.2-11B-Vision-Instruct
24
+ inference: false
25
+ model_creator: meta-llama
26
+ pipeline_tag: text-generation
27
+ prompt_template: '{prompt}
28
+ '
29
+ quantized_by: fbaldassarri
30
+ ---
31
+
32
+ ## Model Information
33
+
34
+ Converted version of [meta-llama/Llama-3.2-11B-Vision-Instruct](meta-llama/Llama-3.2-11B-Vision-Instruct) to [OpenVINO](https://github.com/openvinotoolkit/openvino) Intermediate Representation (IR) for CPU devices inference.
35
+
36
+ Model consists of 2 parts:
37
+
38
+ - **Image Encoder**, as openvino_vision_encoder.bin, for encoding input images into LLM cross attention states space;
39
+ - **Language Model**, as openvino_language_model.bin, for generation answer based on cross attention states provided by Image Encoder and input tokens.
40
+
41
+ Then, for reducing memory consumption, weights compression optimization has applied using [Neural Network Compression Framework (NNCF)](https://github.com/openvinotoolkit/nncf) that provides 4-bit/8-bit mixed weight quantization as a compression method primarily designed to optimize LLMs.
42
+
43
+ Note: Compressed model can be found in as llm_int4_asym_r10_gs64_max_activation_variance_awq_scale_all_layers.bin/.xml
44
+
45
+ - 4 bits (INT4)
46
+ - group size = 64
47
+ - Asymmetrical Quantization
48
+ - method AWQ
49
+
50
+ Finally, an INT8 quantized version of the Imange Enconder only can be find as openvino_vision_encoder_int8.bin/.xml.
51
+
52
+
53
+ ## Replication Recipe
54
+
55
+ ### Step 1 Install Requirements
56
+
57
+ I suggest to install requirements into a dedicated python-virtualenv or a conda enviroment.
58
+
59
+ '''
60
+ pip install -q "torch>=2.1" "torchvision" "Pillow" "tqdm" "datasets>=2.14.6" "gradio>=4.36" "nncf>=2.13.0" --extra-index-url https://download.pytorch.org/whl/cpu
61
+
62
+ pip install -q "transformers>=4.45" --extra-index-url https://download.pytorch.org/whl/cpu
63
+
64
+ pip install -Uq --pre "openvino>2024.4.0" --extra-index-url https://storage.openvinotoolkit.org/simple/wheels/nightly
65
+ '''
66
+
67
+ ### Step 2 Convert the model in OpenVINO Intermediate Representation (IR)
68
+
69
+ '''
70
+ from pathlib import Path
71
+ from ov_mllama_helper import convert_mllama
72
+ model_id = "meta-llama/Llama-3.2-11B-Vision-Instruct"
73
+ model_dir = Path(model_id.split("/")[-1]) / "OpenVino"
74
+ convert_mllama(model_id, model_dir)
75
+ '''
76
+
77
+ ### Step 3 INT4 Compression
78
+
79
+ '''
80
+ from ov_mllama_compression import compress
81
+ from ov_mllama_compression import compression_widgets_helper
82
+ compression_scenario, compress_args = compression_widgets_helper()
83
+ compression_scenario
84
+ compression_kwargs = {key: value.value for key, value in compress_args.items()}
85
+ language_model_path = compress(model_dir, **compression_kwargs)
86
+ '''
87
+
88
+ ### Step 4 INT8 Image Enconder Optimization
89
+
90
+ '''
91
+ from ov_mllama_compression import vision_encoder_selection_widget
92
+ vision_encoder_options = vision_encoder_selection_widget(device.value)
93
+ vision_encoder_options
94
+ from transformers import AutoProcessor
95
+ import nncf
96
+ import openvino as ov
97
+ import gc
98
+ from data_preprocessing import prepare_dataset_vision
99
+ processor = AutoProcessor.from_pretrained(model_dir)
100
+ core = ov.Core()
101
+ fp_vision_encoder_path = model_dir / "openvino_vision_encoder.xml"
102
+ int8_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8.xml")
103
+ int8_wc_vision_encoder_path = model_dir / fp_vision_encoder_path.name.replace(".xml", "_int8_wc.xml")
104
+ calibration_data = prepare_dataset_vision(processor, 100)
105
+ ov_model = core.read_model(fp_vision_encoder_path)
106
+ calibration_dataset = nncf.Dataset(calibration_data)
107
+ quantized_model = nncf.quantize(
108
+ model=ov_model,
109
+ calibration_dataset=calibration_dataset,
110
+ model_type=nncf.ModelType.TRANSFORMER,
111
+ advanced_parameters=nncf.AdvancedQuantizationParameters(smooth_quant_alpha=0.6),
112
+ )
113
+ ov.save_model(quantized_model, int8_vision_encoder_path)
114
+ del quantized_model
115
+ del ov_model
116
+ del calibration_dataset
117
+ del calibration_data
118
+ gc.collect()
119
+ vision_encoder_path = int8_vision_encoder_path
120
+ '''
121
+
122
+ ## License
123
+
124
+ [Llama 3.2 Community License](https://github.com/meta-llama/llama-models/blob/main/models/llama3_2/LICENSE)
125
+
126
+ ## Disclaimer
127
+
128
+ This quantized model comes with no warrenty. It has been developed only for research purposes.
129
+