cicdatopea commited on
Commit
7911b7f
·
verified ·
1 Parent(s): 7d7cb85

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +0 -394
README.md CHANGED
@@ -1,394 +0,0 @@
1
- <div align="center">
2
-
3
- AutoRound
4
- ===========================
5
- <h3> Advanced Quantization Algorithm for LLMs</h3>
6
-
7
- [![python](https://img.shields.io/badge/python-3.9%2B-blue)](https://github.com/intel/auto-round)
8
- [![version](https://img.shields.io/badge/release-0.4.5-green)](https://github.com/intel/auto-round)
9
- [![license](https://img.shields.io/badge/license-Apache%202-9C27B0)](https://github.com/intel/auto-round/blob/main/LICENSE)
10
- <a href="https://huggingface.co/OPEA">
11
- <img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
12
- </a>
13
- ---
14
- <div align="left">
15
-
16
- AutoRound is an advanced quantization algorithm for low-bits LLM/VLM inference. It's tailored for a wide range
17
- of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200
18
- steps,
19
- which competes impressively against recent methods without introducing any additional inference overhead and keeping low
20
- tuning cost. The below
21
- image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
22
- details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri).
23
-
24
- <div align="center">
25
-
26
- ![](docs/imgs/autoround_overview.png)
27
-
28
- <div align="left">
29
-
30
- ## What's New
31
-
32
- * [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
33
- * [2024/11] We provide experimental support for VLM quantization, please check out
34
- the [README](./auto_round/mllm/README.md)
35
- * [2024/11] We provide some tips and tricks for LLM&VLM quantization, please check
36
- out [this blog](https://medium.com/@NeuralCompressor/10-tips-for-quantizing-llms-and-vlms-with-autoround-923e733879a7)
37
-
38
- ## Installation
39
-
40
- ### Install from pypi
41
-
42
- ```bash
43
- # GPU
44
- pip install auto-round[gpu]
45
-
46
- # CPU
47
- pip install auto-round[cpu]
48
-
49
- # HPU
50
- pip install auto-round-lib
51
- ```
52
-
53
- <details>
54
- <summary>Build from Source</summary>
55
-
56
- ```bash
57
- # GPU
58
- pip install .[gpu]
59
-
60
- # CPU
61
- pip install .[cpu]
62
-
63
- # HPU
64
- python setup.py install lib
65
- ```
66
-
67
- </details>
68
-
69
- ## Model Quantization
70
-
71
- ### Basic Usage (Gaudi2/CPU/GPU)
72
-
73
- A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.
74
- Set the format you want in `format` and
75
- multiple formats exporting has been supported. Please check out [step-by-step-instruction](./docs/step_by_step.md) for
76
- more details about calibration dataset or evaluation.
77
-
78
- ```bash
79
- auto-round \
80
- --model facebook/opt-125m \
81
- --bits 4 \
82
- --group_size 128 \
83
- --format "auto_gptq,auto_awq,auto_round" \
84
- --disable_eval \
85
- --output_dir ./tmp_autoround
86
- ```
87
-
88
- We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
89
- <details>
90
- <summary>Other Recipes</summary>
91
-
92
- ```bash
93
- ## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
94
- auto-round-best \
95
- --model facebook/opt-125m \
96
- --bits 4 \
97
- --group_size 128 \
98
- --low_gpu_mem_usage \
99
- --disable_eval
100
- ```
101
-
102
- ```bash
103
- ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
104
- auto-round-fast \
105
- --model facebook/opt-125m \
106
- --bits 4 \
107
- --group_size 128 \
108
- --disable_eval
109
- ```
110
-
111
- </details>
112
-
113
- ### API Usage (Gaudi2/CPU/GPU)
114
-
115
- ```python
116
- from transformers import AutoModelForCausalLM, AutoTokenizer
117
-
118
- model_name = "facebook/opt-125m"
119
- model = AutoModelForCausalLM.from_pretrained(model_name)
120
- tokenizer = AutoTokenizer.from_pretrained(model_name)
121
-
122
- from auto_round import AutoRound
123
-
124
- bits, group_size, sym = 4, 128, True
125
- autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
126
-
127
- ## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
128
- # autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
129
-
130
- ## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
131
- # autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )
132
-
133
- autoround.quantize()
134
- output_dir = "./tmp_autoround"
135
- ## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
136
- autoround.save_quantized(output_dir, format='auto_round', inplace=True)
137
- ```
138
-
139
- <details>
140
- <summary>Detailed Hyperparameters</summary>
141
-
142
- - `model`: The PyTorch model to be quantized.
143
-
144
- - `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.
145
-
146
- - `bits (int)`: Number of bits for quantization (default is 4).
147
-
148
- - `group_size (int)`: Size of the quantization group (default is 128).
149
-
150
- - `sym (bool)`: Whether to use symmetric quantization (default is True).
151
-
152
- - `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current
153
- block for tuning (default is True).
154
-
155
- - `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
156
-
157
- - `iters (int)`: Number of tuning iterations (default is 200).
158
-
159
- - `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
160
-
161
- - `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
162
-
163
- - `nsamples (int)`: Number of samples for tuning (default is 128).
164
-
165
- - `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
166
-
167
- - `batch_size (int)`: Batch size for training (default is 8).
168
-
169
- - `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have
170
- different choices.
171
-
172
- - `amp (bool)`: Whether to use automatic mixed precision (default is True).
173
-
174
- - `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).
175
-
176
- - `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
177
-
178
- - `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
179
-
180
- - `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "
181
- NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "
182
- ./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
183
-
184
- - `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits
185
- or mixed precision.
186
-
187
- - `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
188
-
189
- </details>
190
-
191
- ### API Usage for VLMs
192
-
193
- **This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
194
- adjustments to default hype-parameters
195
-
196
- By default, AutoRoundMLLM only quantizes the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To
197
- quantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature
198
- is limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).
199
-
200
- ```python
201
- from auto_round import AutoRoundMLLM
202
- from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer
203
-
204
- ## load the model
205
- model_name = "Qwen/Qwen2-VL-2B-Instruct"
206
- model = Qwen2VLForConditionalGeneration.from_pretrained(
207
- model_name, trust_remote_code=True)
208
- tokenizer = AutoTokenizer.from_pretrained(model_name)
209
- processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
210
-
211
- ## quantize the model
212
- bits, group_size, sym = 4, 128, True
213
- autoround = AutoRoundMLLM(model, tokenizer, processor,
214
- bits=bits, group_size=group_size, sym=sym)
215
- autoround.quantize()
216
-
217
- # save the quantized model, set format='auto_gptq' or 'auto_awq' to use other formats
218
- output_dir = "./tmp_autoround"
219
- autoround.save_quantized(output_dir, format='auto_round', inplace=True)
220
- ```
221
- #### Export Formats
222
- **AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
223
- inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
224
-
225
- **AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
226
- community, **[2,3,4,8] bits are supported**. However, **the
227
- asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
228
- models.
229
-
230
- **AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
231
- adopted within the community, **only 4-bits quantization is supported**.
232
-
233
- **GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and q4_1 (W4G32) is supported in our repo**.
234
-
235
- ### Quantization Costs
236
-
237
- Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
238
- that data
239
- loading and packing costs have been excluded from the evaluation. **We enable torch.compile for Torch 2.6, but not for
240
- 2.5
241
- due to encountered issues.**
242
-
243
- To optimize GPU memory usage, in addition to activating `low_gpu_mem_usage`, you can set `gradient_accumulate_steps=8`
244
- and a
245
- `batch_size=1`, though this may increase tuning time.
246
-
247
- The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA
248
- 3.1.
249
-
250
- | Torch version/Config W4G128 | 3B | 8B | 14B | 70B | 8X7B |
251
- |---------------------------------------------------------------------------------------------|---------------|----------------|----------------|-----------------|----------------|
252
- | 2.6 with torch compile | 7min<br/>10GB | 12min<br/>18GB | 23min<br/>22GB | 120min<br/>42GB | 28min<br/>46GB |
253
- | 2.6 with torch compile <br/> low_gpu_mem_usage=True | 12min<br/>6GB | 19min<br/>10GB | 33min<br/>11GB | 140min<br/>25GB | 38min<br/>36GB |
254
- | 2.6 with torch compile <br/> low_gpu_mem_usage=True <br/> gradient_accumulate_steps=8,bs=1 | 15min<br/>3GB | 25min<br/>6GB | 45min<br/>7GB | 187min<br/>19GB | 75min<br/>36GB |
255
- | 2.5 w/o torch compile | 8min<br/>10GB | 16min<br/>20GB | 30min<br/>25GB | 140min<br/>49GB | 50min<br/>49GB |
256
-
257
- ## Model Inference
258
-
259
- Please run the quantization code first
260
-
261
- ### AutoRound format
262
-
263
- **CPU**: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip
264
- install intel-extension-for-transformers,
265
-
266
- **HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
267
- in [Gaudi Guide](https://docs.habana.ai/en/latest/).
268
-
269
- **CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source
270
-
271
- #### CPU/HPU/CUDA
272
-
273
- ```python
274
- from transformers import AutoModelForCausalLM, AutoTokenizer
275
- from auto_round import AutoRoundConfig
276
-
277
- backend = "auto" ##cpu, hpu, cuda
278
- quantization_config = AutoRoundConfig(
279
- backend=backend
280
- )
281
- quantized_model_path = "./tmp_autoround"
282
- model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
283
- device_map=backend.split(':')[0],
284
- quantization_config=quantization_config)
285
- tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
286
- text = "There is a girl who likes adventure,"
287
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
288
- print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
289
- ```
290
-
291
- <br>
292
- <details>
293
- <summary>Evaluation</summary>
294
-
295
- ```bash
296
- auto-round --model saved_quantized_model \
297
- --eval \
298
- --task lambada_openai \
299
- --eval_bs 1
300
- ```
301
-
302
- </details>
303
-
304
- ### AutoGPTQ/AutoAWQ format
305
-
306
- ```python
307
- from transformers import AutoModelForCausalLM, AutoTokenizer
308
-
309
- quantized_model_path = "./tmp_autoround"
310
- model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
311
- device_map="auto")
312
- tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
313
- text = "There is a girl who likes adventure,"
314
- inputs = tokenizer(text, return_tensors="pt").to(model.device)
315
- print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
316
- ```
317
-
318
- ## Support List
319
-
320
- AutoRound supports basically all the major large language models.
321
-
322
- Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
323
- different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
324
- release most of the models ourselves.
325
-
326
- Model | Supported |
327
- |-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
328
- | nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), |
329
- | meta-llama/Llama-3.2-90B-Vision-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc) |
330
- | Qwen/QwQ-32B-Preview | [model-opea-int4-sym-autoround-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-inc),[model-opea-int4-sym-autoawq-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc) |
331
- | THUDM/cogvlm2-llama3-chat-19B | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-int4-sym-inc) |
332
- | Qwen/Qwen2-VL-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc) |
333
- | meta-llama/Llama-3.2-11B-Vision | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc) |
334
- | microsoft/Phi-3.5-vision-instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc), [model-opea-int4-sym-gptq](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc) |
335
- | liuhaotian/llava-v1.5-7b | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc) |
336
- | Qwen/Qwen2.5-7B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc) [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-7B-Instruct-AutoRound-GPTQ-asym-4bit), [recipe](./docs/Qwen2.5-7B-Instruct-sym.md) |
337
- | Qwen/Qwen2.5-14B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc) |
338
- | Qwen/Qwen2.5-32B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-32B-Instruct-int4-sym-inc) |
339
- | Qwen/Qwen2.5-Coder-32B-Instruct | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-Coder-32B-Instruct-AutoRound-GPTQ-4bit) |
340
- | Qwen/Qwen2.5-72B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc), [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit), [model-kaitchup-autogptq-int2*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-2bit), [recipe](./docs/Qwen2.5-72B-Instruct-sym.md) |
341
- | meta-llama/Meta-Llama-3.1-70B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc),[model-opea-int4-asym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-asym-inc) |
342
- | meta-llama/Meta-Llama-3.1-8B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) |
343
- | meta-llama/Meta-Llama-3.1-8B | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym) |
344
- | Qwen/Qwen2-7B | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc), [model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc) |
345
- | THUDM/glm-4-9b-chat | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc) |
346
- | Qwen/Qwen2-57B-A14B-Instruct | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc),[model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc) |
347
- | 01-ai/Yi-1.5-9B | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround) |
348
- | 01-ai/Yi-1.5-9B-Chat | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround) |
349
- | Intel/neural-chat-7b-v3-3 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc) |
350
- | Intel/neural-chat-7b-v3-1 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc) |
351
- | TinyLlama-1.1B-intermediate | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse) |
352
- | mistralai/Mistral-7B-v0.1 | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc) |
353
- | google/gemma-2b | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc) |
354
- | tiiuae/falcon-7b | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc) |
355
- | sapienzanlp/modello-italia-9b | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu) |
356
- | microsoft/phi-2 | [model-autoround-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) |
357
- | microsoft/Phi-3.5-mini-instruct | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit) |
358
- | mistralai/Mistral-7B-Instruct-v0.2 | [outdated-recipe](./docs/Mistral-7B-Instruct-v0.2-asym-recipe.md) |
359
- | mistralai/Mixtral-8x7B-Instruct-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-Instruct-v0.1-asym-recipe.md) |
360
- | mistralai/Mixtral-8x7B-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-v0.1-asym-acc.md) |
361
- | meta-llama/Meta-Llama-3-8B-Instruct | [outdated-recipe](./docs/Meta-Llama-3-8B-Instruct-asym-recipe.md) |
362
- | google/gemma-7b | [outdated-recipe](./docs/gemma-7b-asym-recipe.md) |
363
- | meta-llama/Llama-2-7b-chat-hf | [outdated-recipe](./docs/Llama-2-7b-chat-hf-asym-recipe.md) |
364
- | baichuan-inc/Baichuan2-7B-Chat | [outdated-recipe](./docs/baichuan2-7b-cha-asym-recipe.md) |
365
- | 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) |
366
- | facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) |
367
- | bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) |
368
- | EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
369
-
370
- ## Integration
371
-
372
- AutoRound has been integrated into multiple repositories.
373
-
374
- [Intel Neural Compressor](https://github.com/intel/neural-compressor)
375
-
376
- [ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)
377
-
378
- [pytorch/ao](https://github.com/pytorch/ao)
379
-
380
- ## Reference
381
-
382
- If you find AutoRound useful for your research, please cite our paper:
383
-
384
- ```bash
385
- @article{cheng2023optimize,
386
- title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
387
- author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
388
- journal={arXiv preprint arXiv:2309.05516},
389
- year={2023}
390
- }
391
- ```
392
-
393
-
394
-