cicdatopea commited on
Commit
6d59500
·
verified ·
1 Parent(s): ba0a182

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +394 -0
README.md ADDED
@@ -0,0 +1,394 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ <div align="center">
2
+
3
+ AutoRound
4
+ ===========================
5
+ <h3> Advanced Quantization Algorithm for LLMs</h3>
6
+
7
+ [![python](https://img.shields.io/badge/python-3.9%2B-blue)](https://github.com/intel/auto-round)
8
+ [![version](https://img.shields.io/badge/release-0.4.5-green)](https://github.com/intel/auto-round)
9
+ [![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)
10
+ <a href="https://huggingface.co/OPEA">
11
+ <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
12
+ </a>
13
+ ---
14
+ <div align="left">
15
+
16
+ AutoRound is an advanced quantization algorithm for low-bits LLM/VLM inference. It's tailored for a wide range
17
+ of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200
18
+ steps,
19
+ which competes impressively against recent methods without introducing any additional inference overhead and keeping low
20
+ tuning cost. The below
21
+ image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
22
+ details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri).
23
+
24
+ <div align="center">
25
+
26
+ ![](docs/imgs/autoround_overview.png)
27
+
28
+ <div align="left">
29
+
30
+ ## What's New
31
+
32
+ * [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
33
+ * [2024/11] We provide experimental support for VLM quantization, please check out
34
+ the [README](./auto_round/mllm/README.md)
35
+ * [2024/11] We provide some tips and tricks for LLM&VLM quantization, please check
36
+ out [this blog](https://medium.com/@NeuralCompressor/10-tips-for-quantizing-llms-and-vlms-with-autoround-923e733879a7)
37
+
38
+ ## Installation
39
+
40
+ ### Install from pypi
41
+
42
+ ```bash
43
+ # GPU
44
+ pip install auto-round[gpu]
45
+
46
+ # CPU
47
+ pip install auto-round[cpu]
48
+
49
+ # HPU
50
+ pip install auto-round-lib
51
+ ```
52
+
53
+ <details>
54
+ <summary>Build from Source</summary>
55
+
56
+ ```bash
57
+ # GPU
58
+ pip install .[gpu]
59
+
60
+ # CPU
61
+ pip install .[cpu]
62
+
63
+ # HPU
64
+ python setup.py install lib
65
+ ```
66
+
67
+ </details>
68
+
69
+ ## Model Quantization
70
+
71
+ ### Basic Usage (Gaudi2/CPU/GPU)
72
+
73
+ A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.
74
+ Set the format you want in `format` and
75
+ multiple formats exporting has been supported. Please check out [step-by-step-instruction](./docs/step_by_step.md) for
76
+ more details about calibration dataset or evaluation.
77
+
78
+ ```bash
79
+ auto-round \
80
+ --model facebook/opt-125m \
81
+ --bits 4 \
82
+ --group_size 128 \
83
+ --format "auto_gptq,auto_awq,auto_round" \
84
+ --disable_eval \
85
+ --output_dir ./tmp_autoround
86
+ ```
87
+
88
+ We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
89
+ <details>
90
+ <summary>Other Recipes</summary>
91
+
92
+ ```bash
93
+ ## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
94
+ auto-round-best \
95
+ --model facebook/opt-125m \
96
+ --bits 4 \
97
+ --group_size 128 \
98
+ --low_gpu_mem_usage \
99
+ --disable_eval
100
+ ```
101
+
102
+ ```bash
103
+ ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
104
+ auto-round-fast \
105
+ --model facebook/opt-125m \
106
+ --bits 4 \
107
+ --group_size 128 \
108
+ --disable_eval
109
+ ```
110
+
111
+ </details>
112
+
113
+ ### API Usage (Gaudi2/CPU/GPU)
114
+
115
+ ```python
116
+ from transformers import AutoModelForCausalLM, AutoTokenizer
117
+
118
+ model_name = "facebook/opt-125m"
119
+ model = AutoModelForCausalLM.from_pretrained(model_name)
120
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
121
+
122
+ from auto_round import AutoRound
123
+
124
+ bits, group_size, sym = 4, 128, True
125
+ autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
126
+
127
+ ## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
128
+ # autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
129
+
130
+ ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
131
+ # autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )
132
+
133
+ autoround.quantize()
134
+ output_dir = "./tmp_autoround"
135
+ ## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
136
+ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
137
+ ```
138
+
139
+ <details>
140
+ <summary>Detailed Hyperparameters</summary>
141
+
142
+ - `model`: The PyTorch model to be quantized.
143
+
144
+ - `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.
145
+
146
+ - `bits (int)`: Number of bits for quantization (default is 4).
147
+
148
+ - `group_size (int)`: Size of the quantization group (default is 128).
149
+
150
+ - `sym (bool)`: Whether to use symmetric quantization (default is True).
151
+
152
+ - `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current
153
+ block for tuning (default is True).
154
+
155
+ - `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
156
+
157
+ - `iters (int)`: Number of tuning iterations (default is 200).
158
+
159
+ - `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
160
+
161
+ - `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
162
+
163
+ - `nsamples (int)`: Number of samples for tuning (default is 128).
164
+
165
+ - `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
166
+
167
+ - `batch_size (int)`: Batch size for training (default is 8).
168
+
169
+ - `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have
170
+ different choices.
171
+
172
+ - `amp (bool)`: Whether to use automatic mixed precision (default is True).
173
+
174
+ - `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).
175
+
176
+ - `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
177
+
178
+ - `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
179
+
180
+ - `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "
181
+ NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "
182
+ ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
183
+
184
+ - `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits
185
+ or mixed precision.
186
+
187
+ - `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
188
+
189
+ </details>
190
+
191
+ ### API Usage for VLMs
192
+
193
+ **This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
194
+ adjustments to default hype-parameters
195
+
196
+ By default, AutoRoundMLLM only quantizes the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To
197
+ quantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature
198
+ is limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).
199
+
200
+ ```python
201
+ from auto_round import AutoRoundMLLM
202
+ from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer
203
+
204
+ ## load the model
205
+ model_name = "Qwen/Qwen2-VL-2B-Instruct"
206
+ model = Qwen2VLForConditionalGeneration.from_pretrained(
207
+ model_name, trust_remote_code=True)
208
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
209
+ processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
210
+
211
+ ## quantize the model
212
+ bits, group_size, sym = 4, 128, True
213
+ autoround = AutoRoundMLLM(model, tokenizer, processor,
214
+ bits=bits, group_size=group_size, sym=sym)
215
+ autoround.quantize()
216
+
217
+ # save the quantized model, set format='auto_gptq' or 'auto_awq' to use other formats
218
+ output_dir = "./tmp_autoround"
219
+ autoround.save_quantized(output_dir, format='auto_round', inplace=True)
220
+ ```
221
+ #### Export Formats
222
+ **AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
223
+ inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
224
+
225
+ **AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
226
+ community, **[2,3,4,8] bits are supported**. However, **the
227
+ asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
228
+ models.
229
+
230
+ **AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
231
+ adopted within the community, **only 4-bits quantization is supported**.
232
+
233
+ **GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and q4_1 (W4G32) is supported in our repo**.
234
+
235
+ ### Quantization Costs
236
+
237
+ Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
238
+ that data
239
+ loading and packing costs have been excluded from the evaluation. **We enable torch.compile for Torch 2.6, but not for
240
+ 2.5
241
+ due to encountered issues.**
242
+
243
+ To optimize GPU memory usage, in addition to activating `low_gpu_mem_usage`, you can set `gradient_accumulate_steps=8`
244
+ and a
245
+ `batch_size=1`, though this may increase tuning time.
246
+
247
+ The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA
248
+ 3.1.
249
+
250
+ | Torch version/Config W4G128 | 3B | 8B | 14B | 70B | 8X7B |
251
+ |---------------------------------------------------------------------------------------------|---------------|----------------|----------------|-----------------|----------------|
252
+ | 2.6 with torch compile | 7min<br/>10GB | 12min<br/>18GB | 23min<br/>22GB | 120min<br/>42GB | 28min<br/>46GB |
253
+ | 2.6 with torch compile <br/> low_gpu_mem_usage=True | 12min<br/>6GB | 19min<br/>10GB | 33min<br/>11GB | 140min<br/>25GB | 38min<br/>36GB |
254
+ | 2.6 with torch compile <br/> low_gpu_mem_usage=True <br/> gradient_accumulate_steps=8,bs=1 | 15min<br/>3GB | 25min<br/>6GB | 45min<br/>7GB | 187min<br/>19GB | 75min<br/>36GB |
255
+ | 2.5 w/o torch compile | 8min<br/>10GB | 16min<br/>20GB | 30min<br/>25GB | 140min<br/>49GB | 50min<br/>49GB |
256
+
257
+ ## Model Inference
258
+
259
+ Please run the quantization code first
260
+
261
+ ### AutoRound format
262
+
263
+ **CPU**: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip
264
+ install intel-extension-for-transformers,
265
+
266
+ **HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
267
+ in [Gaudi Guide](https://docs.habana.ai/en/latest/).
268
+
269
+ **CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source
270
+
271
+ #### CPU/HPU/CUDA
272
+
273
+ ```python
274
+ from transformers import AutoModelForCausalLM, AutoTokenizer
275
+ from auto_round import AutoRoundConfig
276
+
277
+ backend = "auto" ##cpu, hpu, cuda
278
+ quantization_config = AutoRoundConfig(
279
+ backend=backend
280
+ )
281
+ quantized_model_path = "./tmp_autoround"
282
+ model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
283
+ device_map=backend.split(':')[0],
284
+ quantization_config=quantization_config)
285
+ tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
286
+ text = "There is a girl who likes adventure,"
287
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
288
+ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
289
+ ```
290
+
291
+ <br>
292
+ <details>
293
+ <summary>Evaluation</summary>
294
+
295
+ ```bash
296
+ auto-round --model saved_quantized_model \
297
+ --eval \
298
+ --task lambada_openai \
299
+ --eval_bs 1
300
+ ```
301
+
302
+ </details>
303
+
304
+ ### AutoGPTQ/AutoAWQ format
305
+
306
+ ```python
307
+ from transformers import AutoModelForCausalLM, AutoTokenizer
308
+
309
+ quantized_model_path = "./tmp_autoround"
310
+ model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
311
+ device_map="auto")
312
+ tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
313
+ text = "There is a girl who likes adventure,"
314
+ inputs = tokenizer(text, return_tensors="pt").to(model.device)
315
+ print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
316
+ ```
317
+
318
+ ## Support List
319
+
320
+ AutoRound supports basically all the major large language models.
321
+
322
+ Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
323
+ different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
324
+ release most of the models ourselves.
325
+
326
+ Model | Supported |
327
+ |-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
328
+ | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), |
329
+ | meta-llama/Llama-3.2-90B-Vision-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc) |
330
+ | Qwen/QwQ-32B-Preview | [model-opea-int4-sym-autoround-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-inc),[model-opea-int4-sym-autoawq-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc) |
331
+ | THUDM/cogvlm2-llama3-chat-19B | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-int4-sym-inc) |
332
+ | Qwen/Qwen2-VL-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc) |
333
+ | meta-llama/Llama-3.2-11B-Vision | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc) |
334
+ | microsoft/Phi-3.5-vision-instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc), [model-opea-int4-sym-gptq](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc) |
335
+ | liuhaotian/llava-v1.5-7b | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc) |
336
+ | Qwen/Qwen2.5-7B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc) [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-7B-Instruct-AutoRound-GPTQ-asym-4bit), [recipe](./docs/Qwen2.5-7B-Instruct-sym.md) |
337
+ | Qwen/Qwen2.5-14B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc) |
338
+ | Qwen/Qwen2.5-32B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-32B-Instruct-int4-sym-inc) |
339
+ | Qwen/Qwen2.5-Coder-32B-Instruct | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-Coder-32B-Instruct-AutoRound-GPTQ-4bit) |
340
+ | Qwen/Qwen2.5-72B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc), [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit), [model-kaitchup-autogptq-int2*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-2bit), [recipe](./docs/Qwen2.5-72B-Instruct-sym.md) |
341
+ | meta-llama/Meta-Llama-3.1-70B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc),[model-opea-int4-asym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-asym-inc) |
342
+ | meta-llama/Meta-Llama-3.1-8B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) |
343
+ | meta-llama/Meta-Llama-3.1-8B | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym) |
344
+ | Qwen/Qwen2-7B | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc), [model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc) |
345
+ | THUDM/glm-4-9b-chat | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc) |
346
+ | Qwen/Qwen2-57B-A14B-Instruct | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc),[model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc) |
347
+ | 01-ai/Yi-1.5-9B | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround) |
348
+ | 01-ai/Yi-1.5-9B-Chat | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround) |
349
+ | Intel/neural-chat-7b-v3-3 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc) |
350
+ | Intel/neural-chat-7b-v3-1 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc) |
351
+ | TinyLlama-1.1B-intermediate | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse) |
352
+ | mistralai/Mistral-7B-v0.1 | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc) |
353
+ | google/gemma-2b | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc) |
354
+ | tiiuae/falcon-7b | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc) |
355
+ | sapienzanlp/modello-italia-9b | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu) |
356
+ | microsoft/phi-2 | [model-autoround-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) |
357
+ | microsoft/Phi-3.5-mini-instruct | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit) |
358
+ | mistralai/Mistral-7B-Instruct-v0.2 | [outdated-recipe](./docs/Mistral-7B-Instruct-v0.2-asym-recipe.md) |
359
+ | mistralai/Mixtral-8x7B-Instruct-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-Instruct-v0.1-asym-recipe.md) |
360
+ | mistralai/Mixtral-8x7B-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-v0.1-asym-acc.md) |
361
+ | meta-llama/Meta-Llama-3-8B-Instruct | [outdated-recipe](./docs/Meta-Llama-3-8B-Instruct-asym-recipe.md) |
362
+ | google/gemma-7b | [outdated-recipe](./docs/gemma-7b-asym-recipe.md) |
363
+ | meta-llama/Llama-2-7b-chat-hf | [outdated-recipe](./docs/Llama-2-7b-chat-hf-asym-recipe.md) |
364
+ | baichuan-inc/Baichuan2-7B-Chat | [outdated-recipe](./docs/baichuan2-7b-cha-asym-recipe.md) |
365
+ | 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) |
366
+ | facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) |
367
+ | bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) |
368
+ | EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
369
+
370
+ ## Integration
371
+
372
+ AutoRound has been integrated into multiple repositories.
373
+
374
+ [Intel Neural Compressor](https://github.com/intel/neural-compressor)
375
+
376
+ [ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)
377
+
378
+ [pytorch/ao](https://github.com/pytorch/ao)
379
+
380
+ ## Reference
381
+
382
+ If you find AutoRound useful for your research, please cite our paper:
383
+
384
+ ```bash
385
+ @article{cheng2023optimize,
386
+ title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
387
+ author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
388
+ journal={arXiv preprint arXiv:2309.05516},
389
+ year={2023}
390
+ }
391
+ ```
392
+
393
+
394
+