QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs
Abstract
We introduce QuaRot, a new Quantization scheme based on Rotations, which is able to quantize LLMs end-to-end, including all weights, activations, and KV cache in 4 bits. QuaRot rotates LLMs in a way that removes outliers from the hidden state without changing the output, making quantization easier. This computational invariance is applied to the hidden state (residual) of the LLM, as well as to the activations of the feed-forward components, aspects of the attention mechanism and to the KV cache. The result is a quantized model where all matrix multiplications are performed in 4-bits, without any channels identified for retention in higher precision. Our quantized LLaMa2-70B model has losses of at most 0.29 WikiText-2 perplexity and retains 99% of the zero-shot performance. Code is available at: https://github.com/spcl/QuaRot.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Accurate Block Quantization in LLMs with Outliers (2024)
- IntactKV: Improving Large Language Model Quantization by Keeping Pivot Tokens Intact (2024)
- Mitigating the Impact of Outlier Channels for Language Model Quantization with Activation Regularization (2024)
- GEAR: An Efficient KV Cache Compression Recipe for Near-Lossless Generative Inference of LLM (2024)
- FlattenQuant: Breaking Through the Inference Compute-bound for Large Language Models with Per-tensor Quantization (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper