Update README.md
Browse files
README.md
CHANGED
@@ -1,394 +0,0 @@
|
|
1 |
-
<div align="center">
|
2 |
-
|
3 |
-
AutoRound
|
4 |
-
===========================
|
5 |
-
<h3> Advanced Quantization Algorithm for LLMs</h3>
|
6 |
-
|
7 |
-
[](https://github.com/intel/auto-round)
|
8 |
-
[](https://github.com/intel/auto-round)
|
9 |
-
[](https://github.com/intel/auto-round/blob/main/LICENSE)
|
10 |
-
<a href="https://huggingface.co/OPEA">
|
11 |
-
<img alt="Model Checkpoints" src="https://img.shields.io/badge/%F0%9F%A4%97%20HF-Models-F57C00">
|
12 |
-
</a>
|
13 |
-
---
|
14 |
-
<div align="left">
|
15 |
-
|
16 |
-
AutoRound is an advanced quantization algorithm for low-bits LLM/VLM inference. It's tailored for a wide range
|
17 |
-
of models. AutoRound adopts sign gradient descent to fine-tune rounding values and minmax values of weights in just 200
|
18 |
-
steps,
|
19 |
-
which competes impressively against recent methods without introducing any additional inference overhead and keeping low
|
20 |
-
tuning cost. The below
|
21 |
-
image presents an overview of AutoRound. Check out our paper on [arxiv](https://arxiv.org/pdf/2309.05516) for more
|
22 |
-
details and quantized models in several Hugging Face Spaces, e.g. [OPEA](https://huggingface.co/OPEA), [Kaitchup](https://huggingface.co/kaitchup) and [fbaldassarri](https://huggingface.co/fbaldassarri).
|
23 |
-
|
24 |
-
<div align="center">
|
25 |
-
|
26 |
-

|
27 |
-
|
28 |
-
<div align="left">
|
29 |
-
|
30 |
-
## What's New
|
31 |
-
|
32 |
-
* [2024/01] We provide experimental support for GGUF q4_0 and q4_1 formats.
|
33 |
-
* [2024/11] We provide experimental support for VLM quantization, please check out
|
34 |
-
the [README](./auto_round/mllm/README.md)
|
35 |
-
* [2024/11] We provide some tips and tricks for LLM&VLM quantization, please check
|
36 |
-
out [this blog](https://medium.com/@NeuralCompressor/10-tips-for-quantizing-llms-and-vlms-with-autoround-923e733879a7)
|
37 |
-
|
38 |
-
## Installation
|
39 |
-
|
40 |
-
### Install from pypi
|
41 |
-
|
42 |
-
```bash
|
43 |
-
# GPU
|
44 |
-
pip install auto-round[gpu]
|
45 |
-
|
46 |
-
# CPU
|
47 |
-
pip install auto-round[cpu]
|
48 |
-
|
49 |
-
# HPU
|
50 |
-
pip install auto-round-lib
|
51 |
-
```
|
52 |
-
|
53 |
-
<details>
|
54 |
-
<summary>Build from Source</summary>
|
55 |
-
|
56 |
-
```bash
|
57 |
-
# GPU
|
58 |
-
pip install .[gpu]
|
59 |
-
|
60 |
-
# CPU
|
61 |
-
pip install .[cpu]
|
62 |
-
|
63 |
-
# HPU
|
64 |
-
python setup.py install lib
|
65 |
-
```
|
66 |
-
|
67 |
-
</details>
|
68 |
-
|
69 |
-
## Model Quantization
|
70 |
-
|
71 |
-
### Basic Usage (Gaudi2/CPU/GPU)
|
72 |
-
|
73 |
-
A user guide detailing the full list of supported arguments is provided by calling ```auto-round -h``` on the terminal.
|
74 |
-
Set the format you want in `format` and
|
75 |
-
multiple formats exporting has been supported. Please check out [step-by-step-instruction](./docs/step_by_step.md) for
|
76 |
-
more details about calibration dataset or evaluation.
|
77 |
-
|
78 |
-
```bash
|
79 |
-
auto-round \
|
80 |
-
--model facebook/opt-125m \
|
81 |
-
--bits 4 \
|
82 |
-
--group_size 128 \
|
83 |
-
--format "auto_gptq,auto_awq,auto_round" \
|
84 |
-
--disable_eval \
|
85 |
-
--output_dir ./tmp_autoround
|
86 |
-
```
|
87 |
-
|
88 |
-
We provide two recipes for best accuracy and fast running speed with low memory. Details as below.
|
89 |
-
<details>
|
90 |
-
<summary>Other Recipes</summary>
|
91 |
-
|
92 |
-
```bash
|
93 |
-
## best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
|
94 |
-
auto-round-best \
|
95 |
-
--model facebook/opt-125m \
|
96 |
-
--bits 4 \
|
97 |
-
--group_size 128 \
|
98 |
-
--low_gpu_mem_usage \
|
99 |
-
--disable_eval
|
100 |
-
```
|
101 |
-
|
102 |
-
```bash
|
103 |
-
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
|
104 |
-
auto-round-fast \
|
105 |
-
--model facebook/opt-125m \
|
106 |
-
--bits 4 \
|
107 |
-
--group_size 128 \
|
108 |
-
--disable_eval
|
109 |
-
```
|
110 |
-
|
111 |
-
</details>
|
112 |
-
|
113 |
-
### API Usage (Gaudi2/CPU/GPU)
|
114 |
-
|
115 |
-
```python
|
116 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
117 |
-
|
118 |
-
model_name = "facebook/opt-125m"
|
119 |
-
model = AutoModelForCausalLM.from_pretrained(model_name)
|
120 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
121 |
-
|
122 |
-
from auto_round import AutoRound
|
123 |
-
|
124 |
-
bits, group_size, sym = 4, 128, True
|
125 |
-
autoround = AutoRound(model, tokenizer, bits=bits, group_size=group_size, sym=sym)
|
126 |
-
|
127 |
-
## the best accuracy, 3X slower, low_gpu_mem_usage could save ~20G but ~30% slower
|
128 |
-
# autoround = AutoRound(model, tokenizer, nsamples=512, iters=1000, low_gpu_mem_usage=True, bits=bits, group_size=group_size, sym=sym)
|
129 |
-
|
130 |
-
## fast and low memory, 2-3X speedup, slight accuracy drop at W4G128
|
131 |
-
# autoround = AutoRound(model, tokenizer, nsamples=128, iters=200, seqlen=512, batch_size=4, bits=bits, group_size=group_size, sym=sym )
|
132 |
-
|
133 |
-
autoround.quantize()
|
134 |
-
output_dir = "./tmp_autoround"
|
135 |
-
## format= 'auto_round'(default in version>0.3.0), 'auto_gptq', 'auto_awq'
|
136 |
-
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
|
137 |
-
```
|
138 |
-
|
139 |
-
<details>
|
140 |
-
<summary>Detailed Hyperparameters</summary>
|
141 |
-
|
142 |
-
- `model`: The PyTorch model to be quantized.
|
143 |
-
|
144 |
-
- `tokenizer`: An optional tokenizer for processing input data. If none, a dataset must be provided.
|
145 |
-
|
146 |
-
- `bits (int)`: Number of bits for quantization (default is 4).
|
147 |
-
|
148 |
-
- `group_size (int)`: Size of the quantization group (default is 128).
|
149 |
-
|
150 |
-
- `sym (bool)`: Whether to use symmetric quantization (default is True).
|
151 |
-
|
152 |
-
- `enable_quanted_input (bool)`: Whether to use the output of the previous quantized block as the input for the current
|
153 |
-
block for tuning (default is True).
|
154 |
-
|
155 |
-
- `enable_minmax_tuning (bool)`: Whether to enable weight min-max tuning (default is True).
|
156 |
-
|
157 |
-
- `iters (int)`: Number of tuning iterations (default is 200).
|
158 |
-
|
159 |
-
- `lr (float)`: The learning rate for rounding value (default is None, it will be set to 1.0/iters automatically).
|
160 |
-
|
161 |
-
- `minmax_lr (float)`: The learning rate for min-max tuning (default is None, it will be set to lr automatically).
|
162 |
-
|
163 |
-
- `nsamples (int)`: Number of samples for tuning (default is 128).
|
164 |
-
|
165 |
-
- `seqlen (int)`: Data length of the sequence for tuning (default is 2048).
|
166 |
-
|
167 |
-
- `batch_size (int)`: Batch size for training (default is 8).
|
168 |
-
|
169 |
-
- `scale_dtype (str)`: The data type of quantization scale to be used (default is "float16"), different kernels have
|
170 |
-
different choices.
|
171 |
-
|
172 |
-
- `amp (bool)`: Whether to use automatic mixed precision (default is True).
|
173 |
-
|
174 |
-
- `nblocks (int)`: Packing several blocks as one for tuning together (default is 1).
|
175 |
-
|
176 |
-
- `gradient_accumulate_steps (int)`: Number of gradient accumulation steps (default is 1).
|
177 |
-
|
178 |
-
- `low_gpu_mem_usage (bool)`: Whether to save GPU memory at the cost of ~20% more tuning time (default is False).
|
179 |
-
|
180 |
-
- `dataset Union[str, list, tuple, torch.utils.data.DataLoader]`: The dataset name for tuning (default is "
|
181 |
-
NeelNanda/pile-10k"). Local json file and combination of datasets have been supported, e.g. "
|
182 |
-
./tmp.json,NeelNanda/pile-10k:train, mbpp:train+validation+test"
|
183 |
-
|
184 |
-
- `layer_config (dict)`: Configuration for weight quantization (default is None), mainly for mixed bits
|
185 |
-
or mixed precision.
|
186 |
-
|
187 |
-
- `device`: The device to be used for tuning. The default is set to 'auto', allowing for automatic detection.
|
188 |
-
|
189 |
-
</details>
|
190 |
-
|
191 |
-
### API Usage for VLMs
|
192 |
-
|
193 |
-
**This feature is experimental and may be subject to changes**, including potential bug fixes, API modifications, or
|
194 |
-
adjustments to default hype-parameters
|
195 |
-
|
196 |
-
By default, AutoRoundMLLM only quantizes the text module of VLMs and uses `NeelNanda/pile-10k` for calibration. To
|
197 |
-
quantize the entire model, you can enable `quant_nontext_module` by setting it to True, though support for this feature
|
198 |
-
is limited. For more information, please refer to the AutoRoundMLLM [readme](./auto_round/mllm/README.md).
|
199 |
-
|
200 |
-
```python
|
201 |
-
from auto_round import AutoRoundMLLM
|
202 |
-
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor, AutoTokenizer
|
203 |
-
|
204 |
-
## load the model
|
205 |
-
model_name = "Qwen/Qwen2-VL-2B-Instruct"
|
206 |
-
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
207 |
-
model_name, trust_remote_code=True)
|
208 |
-
tokenizer = AutoTokenizer.from_pretrained(model_name)
|
209 |
-
processor = AutoProcessor.from_pretrained(model_name, trust_remote_code=True)
|
210 |
-
|
211 |
-
## quantize the model
|
212 |
-
bits, group_size, sym = 4, 128, True
|
213 |
-
autoround = AutoRoundMLLM(model, tokenizer, processor,
|
214 |
-
bits=bits, group_size=group_size, sym=sym)
|
215 |
-
autoround.quantize()
|
216 |
-
|
217 |
-
# save the quantized model, set format='auto_gptq' or 'auto_awq' to use other formats
|
218 |
-
output_dir = "./tmp_autoround"
|
219 |
-
autoround.save_quantized(output_dir, format='auto_round', inplace=True)
|
220 |
-
```
|
221 |
-
#### Export Formats
|
222 |
-
**AutoRound Format**: This format is well-suited for CPU, HPU devices, 2 bits, as well as mixed-precision
|
223 |
-
inference. **[2,4] bits are supported**. However, it has not yet gained widespread community adoption.
|
224 |
-
|
225 |
-
**AutoGPTQ Format**: This format is well-suited for symmetric quantization on CUDA devices and is widely adopted by the
|
226 |
-
community, **[2,3,4,8] bits are supported**. However, **the
|
227 |
-
asymmetric kernel has issues** that can cause considerable accuracy drops, particularly at 2-bit quantization and small
|
228 |
-
models.
|
229 |
-
|
230 |
-
**AutoAWQ Format**: This format is well-suited for asymmetric 4-bit quantization on CUDA devices and is widely
|
231 |
-
adopted within the community, **only 4-bits quantization is supported**.
|
232 |
-
|
233 |
-
**GGUF** Format: This format is well-suited for CPU devices and is widely adopted by the community, **only q4_0 and q4_1 (W4G32) is supported in our repo**.
|
234 |
-
|
235 |
-
### Quantization Costs
|
236 |
-
|
237 |
-
Testing was conducted on the Nvidia A100 80G using the nightly version of PyTorch 2.6.0.dev20241029+cu124. Please note
|
238 |
-
that data
|
239 |
-
loading and packing costs have been excluded from the evaluation. **We enable torch.compile for Torch 2.6, but not for
|
240 |
-
2.5
|
241 |
-
due to encountered issues.**
|
242 |
-
|
243 |
-
To optimize GPU memory usage, in addition to activating `low_gpu_mem_usage`, you can set `gradient_accumulate_steps=8`
|
244 |
-
and a
|
245 |
-
`batch_size=1`, though this may increase tuning time.
|
246 |
-
|
247 |
-
The 3B and 14B models were evaluated on Qwen 2.5, the 8X7B model is Mixtral, while the remaining models utilized LLaMA
|
248 |
-
3.1.
|
249 |
-
|
250 |
-
| Torch version/Config W4G128 | 3B | 8B | 14B | 70B | 8X7B |
|
251 |
-
|---------------------------------------------------------------------------------------------|---------------|----------------|----------------|-----------------|----------------|
|
252 |
-
| 2.6 with torch compile | 7min<br/>10GB | 12min<br/>18GB | 23min<br/>22GB | 120min<br/>42GB | 28min<br/>46GB |
|
253 |
-
| 2.6 with torch compile <br/> low_gpu_mem_usage=True | 12min<br/>6GB | 19min<br/>10GB | 33min<br/>11GB | 140min<br/>25GB | 38min<br/>36GB |
|
254 |
-
| 2.6 with torch compile <br/> low_gpu_mem_usage=True <br/> gradient_accumulate_steps=8,bs=1 | 15min<br/>3GB | 25min<br/>6GB | 45min<br/>7GB | 187min<br/>19GB | 75min<br/>36GB |
|
255 |
-
| 2.5 w/o torch compile | 8min<br/>10GB | 16min<br/>20GB | 30min<br/>25GB | 140min<br/>49GB | 50min<br/>49GB |
|
256 |
-
|
257 |
-
## Model Inference
|
258 |
-
|
259 |
-
Please run the quantization code first
|
260 |
-
|
261 |
-
### AutoRound format
|
262 |
-
|
263 |
-
**CPU**: pip install intel-extension-for-pytorch(much higher speed on Intel CPU) or pip
|
264 |
-
install intel-extension-for-transformers,
|
265 |
-
|
266 |
-
**HPU**: docker image with Gaudi Software Stack is recommended. More details can be found
|
267 |
-
in [Gaudi Guide](https://docs.habana.ai/en/latest/).
|
268 |
-
|
269 |
-
**CUDA**: no extra operations for sym quantization, for asym quantization, need to install auto-round from source
|
270 |
-
|
271 |
-
#### CPU/HPU/CUDA
|
272 |
-
|
273 |
-
```python
|
274 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
275 |
-
from auto_round import AutoRoundConfig
|
276 |
-
|
277 |
-
backend = "auto" ##cpu, hpu, cuda
|
278 |
-
quantization_config = AutoRoundConfig(
|
279 |
-
backend=backend
|
280 |
-
)
|
281 |
-
quantized_model_path = "./tmp_autoround"
|
282 |
-
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
|
283 |
-
device_map=backend.split(':')[0],
|
284 |
-
quantization_config=quantization_config)
|
285 |
-
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
|
286 |
-
text = "There is a girl who likes adventure,"
|
287 |
-
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
288 |
-
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
|
289 |
-
```
|
290 |
-
|
291 |
-
<br>
|
292 |
-
<details>
|
293 |
-
<summary>Evaluation</summary>
|
294 |
-
|
295 |
-
```bash
|
296 |
-
auto-round --model saved_quantized_model \
|
297 |
-
--eval \
|
298 |
-
--task lambada_openai \
|
299 |
-
--eval_bs 1
|
300 |
-
```
|
301 |
-
|
302 |
-
</details>
|
303 |
-
|
304 |
-
### AutoGPTQ/AutoAWQ format
|
305 |
-
|
306 |
-
```python
|
307 |
-
from transformers import AutoModelForCausalLM, AutoTokenizer
|
308 |
-
|
309 |
-
quantized_model_path = "./tmp_autoround"
|
310 |
-
model = AutoModelForCausalLM.from_pretrained(quantized_model_path,
|
311 |
-
device_map="auto")
|
312 |
-
tokenizer = AutoTokenizer.from_pretrained(quantized_model_path)
|
313 |
-
text = "There is a girl who likes adventure,"
|
314 |
-
inputs = tokenizer(text, return_tensors="pt").to(model.device)
|
315 |
-
print(tokenizer.decode(model.generate(**inputs, max_new_tokens=50)[0]))
|
316 |
-
```
|
317 |
-
|
318 |
-
## Support List
|
319 |
-
|
320 |
-
AutoRound supports basically all the major large language models.
|
321 |
-
|
322 |
-
Please note that an asterisk (*) indicates third-party quantized models, which may lack accuracy data and use a
|
323 |
-
different recipe. We greatly appreciate their efforts and encourage more users to share their models, as we cannot
|
324 |
-
release most of the models ourselves.
|
325 |
-
|
326 |
-
Model | Supported |
|
327 |
-
|-------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|
328 |
-
| nvidia/Llama-3.1-Nemotron-70B-Instruct-HF | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.1-Nemotron-70B-Instruct-HF-int4-sym-inc), |
|
329 |
-
| meta-llama/Llama-3.2-90B-Vision-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-90B-Vision-Instruct-int4-sym-inc) |
|
330 |
-
| Qwen/QwQ-32B-Preview | [model-opea-int4-sym-autoround-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-inc),[model-opea-int4-sym-autoawq-mixed](https://huggingface.co/OPEA/QwQ-32B-Preview-int4-sym-mixed-awq-inc) |
|
331 |
-
| THUDM/cogvlm2-llama3-chat-19B | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/cogvlm2-llama3-chat-19B-int4-sym-inc) |
|
332 |
-
| Qwen/Qwen2-VL-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2-VL-7B-Instruct-int4-sym-inc) |
|
333 |
-
| meta-llama/Llama-3.2-11B-Vision | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Llama-3.2-11B-Vision-Instruct-int4-sym-inc) |
|
334 |
-
| microsoft/Phi-3.5-vision-instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc), [model-opea-int4-sym-gptq](https://huggingface.co/OPEA/Phi-3.5-vision-instruct-int4-sym-inc) |
|
335 |
-
| liuhaotian/llava-v1.5-7b | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/llava-v1.5-7b-int4-sym-inc) |
|
336 |
-
| Qwen/Qwen2.5-7B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-7B-Instruct-int4-sym-inc) [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-7B-Instruct-AutoRound-GPTQ-asym-4bit), [recipe](./docs/Qwen2.5-7B-Instruct-sym.md) |
|
337 |
-
| Qwen/Qwen2.5-14B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-14B-Instruct-int4-sym-inc) |
|
338 |
-
| Qwen/Qwen2.5-32B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-32B-Instruct-int4-sym-inc) |
|
339 |
-
| Qwen/Qwen2.5-Coder-32B-Instruct | [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-Coder-32B-Instruct-AutoRound-GPTQ-4bit) |
|
340 |
-
| Qwen/Qwen2.5-72B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Qwen2.5-72B-Instruct-int4-sym-inc), [model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-4bit), [model-kaitchup-autogptq-int2*](https://huggingface.co/kaitchup/Qwen2.5-72B-Instruct-AutoRound-GPTQ-2bit), [recipe](./docs/Qwen2.5-72B-Instruct-sym.md) |
|
341 |
-
| meta-llama/Meta-Llama-3.1-70B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc), [model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-sym-inc),[model-opea-int4-asym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-70B-Instruct-int4-asym-inc) |
|
342 |
-
| meta-llama/Meta-Llama-3.1-8B-Instruct | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/Meta-Llama-3.1-8B-Instruct-int4-sym-inc),[model-kaitchup-autogptq-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-asym), [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-Instruct-autoround-gptq-4bit-sym), [recipe](https://huggingface.co/Intel/Meta-Llama-3.1-8B-Instruct-int4-inc) |
|
343 |
-
| meta-llama/Meta-Llama-3.1-8B | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Meta-Llama-3.1-8B-autoround-gptq-4bit-sym) |
|
344 |
-
| Qwen/Qwen2-7B | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc), [model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-7B-int4-inc) |
|
345 |
-
| THUDM/glm-4-9b-chat | [model-opea-int4-sym-autoround](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc),[model-opea-int4-sym-autogptq](https://huggingface.co/OPEA/glm-4-9b-chat-int4-sym-inc) |
|
346 |
-
| Qwen/Qwen2-57B-A14B-Instruct | [model-autoround-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc),[model-autogptq-sym-int4](https://huggingface.co/Intel/Qwen2-57B-A14B-Instruct-int4-inc) |
|
347 |
-
| 01-ai/Yi-1.5-9B | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-4bit-gptq-autoround) |
|
348 |
-
| 01-ai/Yi-1.5-9B-Chat | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/Yi-1.5-9B-Chat-4bit-gptq-autoround) |
|
349 |
-
| Intel/neural-chat-7b-v3-3 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-3-int4-inc) |
|
350 |
-
| Intel/neural-chat-7b-v3-1 | [model-autogptq-int4](https://huggingface.co/Intel/neural-chat-7b-v3-1-int4-inc) |
|
351 |
-
| TinyLlama-1.1B-intermediate | [model-LnL-AI-autogptq-int4*](https://huggingface.co/LnL-AI/TinyLlama-1.1B-intermediate-step-1341k-3T-autoround-lm_head-symFalse) |
|
352 |
-
| mistralai/Mistral-7B-v0.1 | [model-autogptq-lmhead-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc-lmhead), [model-autogptq-int4](https://huggingface.co/Intel/Mistral-7B-v0.1-int4-inc) |
|
353 |
-
| google/gemma-2b | [model-autogptq-int4](https://huggingface.co/Intel/gemma-2b-int4-inc) |
|
354 |
-
| tiiuae/falcon-7b | [model-autogptq-int4-G64](https://huggingface.co/Intel/falcon-7b-int4-inc) |
|
355 |
-
| sapienzanlp/modello-italia-9b | [model-fbaldassarri-autogptq-int4*](https://huggingface.co/fbaldassarri/modello-italia-9b-autoround-w4g128-cpu) |
|
356 |
-
| microsoft/phi-2 | [model-autoround-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) [model-autogptq-sym-int4](https://huggingface.co/Intel/phi-2-int4-inc) |
|
357 |
-
| microsoft/Phi-3.5-mini-instruct | [model-kaitchup-autogptq-sym-int4*](https://huggingface.co/kaitchup/Phi-3.5-Mini-instruct-AutoRound-4bit) |
|
358 |
-
| mistralai/Mistral-7B-Instruct-v0.2 | [outdated-recipe](./docs/Mistral-7B-Instruct-v0.2-asym-recipe.md) |
|
359 |
-
| mistralai/Mixtral-8x7B-Instruct-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-Instruct-v0.1-asym-recipe.md) |
|
360 |
-
| mistralai/Mixtral-8x7B-v0.1 | [outdated-recipe](./docs/Mixtral-8x7B-v0.1-asym-acc.md) |
|
361 |
-
| meta-llama/Meta-Llama-3-8B-Instruct | [outdated-recipe](./docs/Meta-Llama-3-8B-Instruct-asym-recipe.md) |
|
362 |
-
| google/gemma-7b | [outdated-recipe](./docs/gemma-7b-asym-recipe.md) |
|
363 |
-
| meta-llama/Llama-2-7b-chat-hf | [outdated-recipe](./docs/Llama-2-7b-chat-hf-asym-recipe.md) |
|
364 |
-
| baichuan-inc/Baichuan2-7B-Chat | [outdated-recipe](./docs/baichuan2-7b-cha-asym-recipe.md) |
|
365 |
-
| 01-ai/Yi-6B-Chat | [outdated-recipe](./docs/Yi-6B-Chat-asym-recipe.md) |
|
366 |
-
| facebook/opt-2.7b | [outdated-recipe](./docs/opt-2.7b-asym-recipe.md) |
|
367 |
-
| bigscience/bloom-3b | [outdated-recipe](./docs/bloom-3B-asym-recipe.md) |
|
368 |
-
| EleutherAI/gpt-j-6b | [outdated-recipe](./docs/gpt-j-6B-asym-recipe.md) |
|
369 |
-
|
370 |
-
## Integration
|
371 |
-
|
372 |
-
AutoRound has been integrated into multiple repositories.
|
373 |
-
|
374 |
-
[Intel Neural Compressor](https://github.com/intel/neural-compressor)
|
375 |
-
|
376 |
-
[ModelCloud/GPTQModel](https://github.com/ModelCloud/GPTQModel)
|
377 |
-
|
378 |
-
[pytorch/ao](https://github.com/pytorch/ao)
|
379 |
-
|
380 |
-
## Reference
|
381 |
-
|
382 |
-
If you find AutoRound useful for your research, please cite our paper:
|
383 |
-
|
384 |
-
```bash
|
385 |
-
@article{cheng2023optimize,
|
386 |
-
title={Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs},
|
387 |
-
author={Cheng, Wenhua and Zhang, Weiwei and Shen, Haihao and Cai, Yiyang and He, Xin and Lv, Kaokao and Liu, Yi},
|
388 |
-
journal={arXiv preprint arXiv:2309.05516},
|
389 |
-
year={2023}
|
390 |
-
}
|
391 |
-
```
|
392 |
-
|
393 |
-
|
394 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|