korywat commited on
Commit
516e088
·
verified ·
1 Parent(s): 1382ac7

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +329 -0
README.md ADDED
@@ -0,0 +1,329 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: pytorch
3
+ license: apache-2.0
4
+ pipeline_tag: text-generation
5
+ tags:
6
+ - llm
7
+ - generative_ai
8
+ - quantized
9
+ - android
10
+
11
+ ---
12
+
13
+ ![](https://qaihub-public-assets.s3.us-west-2.amazonaws.com/qai-hub-models/models/mistral_7b_instruct_v0_3_quantized/web-assets/model_demo.png)
14
+
15
+ # Mistral-7B-Instruct-v0_3: Optimized for Mobile Deployment
16
+ ## State-of-the-art large language model useful on a variety of language understanding and generation tasks
17
+
18
+ The Mistral-7B-Instruct-v0.3 Large Language Model (LLM) is an instruct fine-tuned version of the Mistral-7B-v0.3.
19
+
20
+ This model is an implementation of Mistral-7B-Instruct-v0_3 found [here]({source_repo}).
21
+ This repository provides scripts to run Mistral-7B-Instruct-v0_3 on Qualcomm® devices.
22
+ More details on model performance across various devices, can be found
23
+ [here](https://aihub.qualcomm.com/models/mistral_7b_instruct_v0_3_quantized).
24
+
25
+
26
+ ### Model Details
27
+
28
+ - **Model Type:** Text generation
29
+ - **Model Stats:**
30
+ - Number of parameters: 7.3B
31
+ - Precision: w8a16
32
+ - Num of key-value heads: 8
33
+ - Information about the model: ['Prompt Processor and Token Generator are split into 4 parts each.', 'Each corresponding Prompt Processor and Token Generator share weights.']
34
+ - Max context length: 4096
35
+ - Prompt processor model size: 4.17 GB
36
+ - Prompt processor input: 128 tokens + KVCache initialized with pad token
37
+ - Prompt processor output: 128 output tokens + KVCache for token generator
38
+ - Token generator model size: 4.17 GB
39
+ - Token generator input: 1 input token + past KVCache
40
+ - Token generator output: 1 output token + KVCache for next iteration
41
+ - Decoding length: 4096
42
+ - Use: Initiate conversation with prompt-processor and then token generator for subsequent iterations.
43
+
44
+ | Model | Device | Chipset | Target Runtime | Response Rate (Tokens/Second) | Time To First Token Range (Seconds) | Tiny MMLU
45
+ |---|---|---|---|---|---|---|
46
+ | Mistral-7B-Instruct-v0_3 | Snapdragon 8 Elite QRD | Snapdragon® 8 Elite | QNN | 10.73 | 180000 - 5790000 | (180000, 5790000) | 58.85% | Use Export Script |
47
+
48
+ ## Deploying Mistral 7B Instruct v3.0 on-device
49
+ Please follow [this tutorial](https://github.com/quic/ai-hub-apps/tree/main/tutorials/llama)
50
+ to compile QNN binaries and generate bundle assets to run [ChatApp on Windows](https://github.com/quic/ai-hub-apps/tree/main/apps/windows/cpp/ChatApp) and on Android powered by QNN-Genie.
51
+
52
+
53
+
54
+ ## Installation
55
+
56
+ This model can be installed as a Python package via pip.
57
+
58
+ ```bash
59
+ pip install qai-hub-models
60
+ ```
61
+
62
+
63
+ ## Configure Qualcomm® AI Hub to run this model on a cloud-hosted device
64
+
65
+ Sign-in to [Qualcomm® AI Hub](https://app.aihub.qualcomm.com/) with your
66
+ Qualcomm® ID. Once signed in navigate to `Account -> Settings -> API Token`.
67
+
68
+ With this API token, you can configure your client to run models on the cloud
69
+ hosted devices.
70
+ ```bash
71
+ qai-hub configure --api_token API_TOKEN
72
+ ```
73
+ Navigate to [docs](https://app.aihub.qualcomm.com/docs/) for more information.
74
+
75
+
76
+
77
+ ## Demo on-device
78
+
79
+ The package contains a simple end-to-end demo that downloads pre-trained
80
+ weights and runs this model on a sample input.
81
+
82
+ ```bash
83
+ python -m qai_hub_models.models.mistral_7b_instruct_v0_3_quantized.demo
84
+ ```
85
+
86
+ The above demo runs a reference implementation of pre-processing, model
87
+ inference, and post processing.
88
+
89
+ **NOTE**: If you want running in a Jupyter Notebook or Google Colab like
90
+ environment, please add the following to your cell (instead of the above).
91
+ ```
92
+ %run -m qai_hub_models.models.mistral_7b_instruct_v0_3_quantized.demo
93
+ ```
94
+
95
+
96
+ ### Run model on a cloud-hosted device
97
+
98
+ In addition to the demo, you can also run the model on a cloud-hosted Qualcomm®
99
+ device. This script does the following:
100
+ * Performance check on-device on a cloud-hosted device
101
+ * Downloads compiled assets that can be deployed on-device for Android.
102
+ * Accuracy check between PyTorch and on-device outputs.
103
+
104
+ ```bash
105
+ python -m qai_hub_models.models.mistral_7b_instruct_v0_3_quantized.export
106
+ ```
107
+ ```
108
+ Profiling Results
109
+ ------------------------------------------------------------
110
+
111
+ Device : Snapdragon 8 Elite QRD (15)
112
+ Runtime : QNN
113
+ Response Rate (Tokens/Second): 10.73
114
+ Time to First Token (Seconds): (180000, 5790000)
115
+ ```
116
+
117
+
118
+ ## How does this work?
119
+
120
+ This [export script](https://aihub.qualcomm.com/models/mistral_7b_instruct_v0_3_quantized/qai_hub_models/models/Mistral-7B-Instruct-v0_3/export.py)
121
+ leverages [Qualcomm® AI Hub](https://aihub.qualcomm.com/) to optimize, validate, and deploy this model
122
+ on-device. Lets go through each step below in detail:
123
+
124
+ Step 1: **Upload compiled model**
125
+
126
+ Upload compiled models from `qai_hub_models.models.mistral_7b_instruct_v0_3_quantized` on hub.
127
+ ```python
128
+ import torch
129
+
130
+ import qai_hub as hub
131
+ from qai_hub_models.models.mistral_7b_instruct_v0_3_quantized import Model
132
+
133
+ # Load the model
134
+ model = Model.from_precompiled()
135
+
136
+ model_promptprocessor_part1 = hub.upload_model(model.prompt_processor_part1.get_target_model_path())
137
+ model_promptprocessor_part2 = hub.upload_model(model.prompt_processor_part2.get_target_model_path())
138
+ model_promptprocessor_part3 = hub.upload_model(model.prompt_processor_part3.get_target_model_path())
139
+ model_promptprocessor_part4 = hub.upload_model(model.prompt_processor_part4.get_target_model_path())
140
+ model_tokengenerator_part1 = hub.upload_model(model.token_generator_part1.get_target_model_path())
141
+ model_tokengenerator_part2 = hub.upload_model(model.token_generator_part2.get_target_model_path())
142
+ model_tokengenerator_part3 = hub.upload_model(model.token_generator_part3.get_target_model_path())
143
+ model_tokengenerator_part4 = hub.upload_model(model.token_generator_part4.get_target_model_path())
144
+ ```
145
+
146
+
147
+ Step 2: **Performance profiling on cloud-hosted device**
148
+
149
+ After uploading compiled models from step 1. Models can be profiled model on-device using the
150
+ `target_model`. Note that this scripts runs the model on a device automatically
151
+ provisioned in the cloud. Once the job is submitted, you can navigate to a
152
+ provided job URL to view a variety of on-device performance metrics.
153
+ ```python
154
+
155
+ # Device
156
+ device = hub.Device("Samsung Galaxy S23")
157
+ profile_job_promptprocessor_part1 = hub.submit_profile_job(
158
+ model=model_promptprocessor_part1,
159
+ device=device,
160
+ )
161
+ profile_job_promptprocessor_part2 = hub.submit_profile_job(
162
+ model=model_promptprocessor_part2,
163
+ device=device,
164
+ )
165
+ profile_job_promptprocessor_part3 = hub.submit_profile_job(
166
+ model=model_promptprocessor_part3,
167
+ device=device,
168
+ )
169
+ profile_job_promptprocessor_part4 = hub.submit_profile_job(
170
+ model=model_promptprocessor_part4,
171
+ device=device,
172
+ )
173
+ profile_job_tokengenerator_part1 = hub.submit_profile_job(
174
+ model=model_tokengenerator_part1,
175
+ device=device,
176
+ )
177
+ profile_job_tokengenerator_part2 = hub.submit_profile_job(
178
+ model=model_tokengenerator_part2,
179
+ device=device,
180
+ )
181
+ profile_job_tokengenerator_part3 = hub.submit_profile_job(
182
+ model=model_tokengenerator_part3,
183
+ device=device,
184
+ )
185
+ profile_job_tokengenerator_part4 = hub.submit_profile_job(
186
+ model=model_tokengenerator_part4,
187
+ device=device,
188
+ )
189
+
190
+ ```
191
+
192
+ Step 3: **Verify on-device accuracy**
193
+
194
+ To verify the accuracy of the model on-device, you can run on-device inference
195
+ on sample input data on the same cloud hosted device.
196
+ ```python
197
+
198
+ input_data_promptprocessor_part1 = model.prompt_processor_part1.sample_inputs()
199
+ inference_job_promptprocessor_part1 = hub.submit_inference_job(
200
+ model=model_promptprocessor_part1,
201
+ device=device,
202
+ inputs=input_data_promptprocessor_part1,
203
+ )
204
+ on_device_output_promptprocessor_part1 = inference_job_promptprocessor_part1.download_output_data()
205
+
206
+ input_data_promptprocessor_part2 = model.prompt_processor_part2.sample_inputs()
207
+ inference_job_promptprocessor_part2 = hub.submit_inference_job(
208
+ model=model_promptprocessor_part2,
209
+ device=device,
210
+ inputs=input_data_promptprocessor_part2,
211
+ )
212
+ on_device_output_promptprocessor_part2 = inference_job_promptprocessor_part2.download_output_data()
213
+
214
+ input_data_promptprocessor_part3 = model.prompt_processor_part3.sample_inputs()
215
+ inference_job_promptprocessor_part3 = hub.submit_inference_job(
216
+ model=model_promptprocessor_part3,
217
+ device=device,
218
+ inputs=input_data_promptprocessor_part3,
219
+ )
220
+ on_device_output_promptprocessor_part3 = inference_job_promptprocessor_part3.download_output_data()
221
+
222
+ input_data_promptprocessor_part4 = model.prompt_processor_part4.sample_inputs()
223
+ inference_job_promptprocessor_part4 = hub.submit_inference_job(
224
+ model=model_promptprocessor_part4,
225
+ device=device,
226
+ inputs=input_data_promptprocessor_part4,
227
+ )
228
+ on_device_output_promptprocessor_part4 = inference_job_promptprocessor_part4.download_output_data()
229
+
230
+ input_data_tokengenerator_part1 = model.token_generator_part1.sample_inputs()
231
+ inference_job_tokengenerator_part1 = hub.submit_inference_job(
232
+ model=model_tokengenerator_part1,
233
+ device=device,
234
+ inputs=input_data_tokengenerator_part1,
235
+ )
236
+ on_device_output_tokengenerator_part1 = inference_job_tokengenerator_part1.download_output_data()
237
+
238
+ input_data_tokengenerator_part2 = model.token_generator_part2.sample_inputs()
239
+ inference_job_tokengenerator_part2 = hub.submit_inference_job(
240
+ model=model_tokengenerator_part2,
241
+ device=device,
242
+ inputs=input_data_tokengenerator_part2,
243
+ )
244
+ on_device_output_tokengenerator_part2 = inference_job_tokengenerator_part2.download_output_data()
245
+
246
+ input_data_tokengenerator_part3 = model.token_generator_part3.sample_inputs()
247
+ inference_job_tokengenerator_part3 = hub.submit_inference_job(
248
+ model=model_tokengenerator_part3,
249
+ device=device,
250
+ inputs=input_data_tokengenerator_part3,
251
+ )
252
+ on_device_output_tokengenerator_part3 = inference_job_tokengenerator_part3.download_output_data()
253
+
254
+ input_data_tokengenerator_part4 = model.token_generator_part4.sample_inputs()
255
+ inference_job_tokengenerator_part4 = hub.submit_inference_job(
256
+ model=model_tokengenerator_part4,
257
+ device=device,
258
+ inputs=input_data_tokengenerator_part4,
259
+ )
260
+ on_device_output_tokengenerator_part4 = inference_job_tokengenerator_part4.download_output_data()
261
+
262
+ ```
263
+ With the output of the model, you can compute like PSNR, relative errors or
264
+ spot check the output with expected output.
265
+
266
+ **Note**: This on-device profiling and inference requires access to Qualcomm®
267
+ AI Hub. [Sign up for access](https://myaccount.qualcomm.com/signup).
268
+
269
+
270
+
271
+
272
+ ## Deploying compiled model to Android
273
+
274
+
275
+ The models can be deployed using multiple runtimes:
276
+ - TensorFlow Lite (`.tflite` export): [This
277
+ tutorial](https://www.tensorflow.org/lite/android/quickstart) provides a
278
+ guide to deploy the .tflite model in an Android application.
279
+
280
+
281
+ - QNN ( `.so` / `.bin` export ): This [sample
282
+ app](https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-50/sample_app.html)
283
+ provides instructions on how to use the `.so` shared library or `.bin` context binary in an Android application.
284
+
285
+
286
+ ## View on Qualcomm® AI Hub
287
+ Get more details on Mistral-7B-Instruct-v0_3's performance across various devices [here](https://aihub.qualcomm.com/models/mistral_7b_instruct_v0_3_quantized).
288
+ Explore all available models on [Qualcomm® AI Hub](https://aihub.qualcomm.com/)
289
+
290
+
291
+ ## License
292
+ * The license for the original implementation of Mistral-7B-Instruct-v0_3 can be found [here](https://github.com/mistralai/mistral-inference/blob/main/LICENSE).
293
+ * The license for the compiled assets for on-device deployment can be found [here](https://github.com/mistralai/mistral-inference/blob/main/LICENSE)
294
+
295
+
296
+
297
+ ## References
298
+ * [Mistral 7B](https://arxiv.org/abs/2310.06825)
299
+ * [Source Model Implementation](https://github.com/mistralai/mistral-inference)
300
+
301
+
302
+
303
+ ## Community
304
+ * Join [our AI Hub Slack community](https://aihub.qualcomm.com/community/slack) to collaborate, post questions and learn more about on-device AI.
305
+ * For questions or feedback please [reach out to us](mailto:[email protected]).
306
+
307
+
308
+ ## Usage and Limitations
309
+
310
+ Model may not be used for or in connection with any of the following applications:
311
+
312
+ - Accessing essential private and public services and benefits;
313
+ - Administration of justice and democratic processes;
314
+ - Assessing or recognizing the emotional state of a person;
315
+ - Biometric and biometrics-based systems, including categorization of persons based on sensitive characteristics;
316
+ - Education and vocational training;
317
+ - Employment and workers management;
318
+ - Exploitation of the vulnerabilities of persons resulting in harmful behavior;
319
+ - General purpose social scoring;
320
+ - Law enforcement;
321
+ - Management and operation of critical infrastructure;
322
+ - Migration, asylum and border control management;
323
+ - Predictive policing;
324
+ - Real-time remote biometric identification in public spaces;
325
+ - Recommender systems of social media platforms;
326
+ - Scraping of facial images (from the internet or otherwise); and/or
327
+ - Subliminal manipulation
328
+
329
+