Abstract
A unified scaling law for quantization-aware training (QAT) identifies key factors affecting quantization error, leading to improvements through mixed-precision quantization.
Large language models (LLMs) demand substantial computational and memory resources, creating deployment challenges. Quantization-aware training (QAT) addresses these challenges by reducing model precision while maintaining performance. However, the scaling behavior of QAT, especially at 4-bit precision (W4A4), is not well understood. Existing QAT scaling laws often ignore key factors such as the number of training tokens and quantization granularity, which limits their applicability. This paper proposes a unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size. Through 268 QAT experiments, we show that quantization error decreases as model size increases, but rises with more training tokens and coarser quantization granularity. To identify the sources of W4A4 quantization error, we decompose it into weight and activation components. Both components follow the overall trend of W4A4 quantization error, but with different sensitivities. Specifically, weight quantization error increases more rapidly with more training tokens. Further analysis shows that the activation quantization error in the FC2 layer, caused by outliers, is the primary bottleneck of W4A4 QAT quantization error. By applying mixed-precision quantization to address this bottleneck, we demonstrate that weight and activation quantization errors can converge to similar levels. Additionally, with more training data, weight quantization error eventually exceeds activation quantization error, suggesting that reducing weight quantization error is also important in such scenarios. These findings offer key insights for improving QAT research and development.
Community
A unified scaling law for QAT that models quantization error as a function of model size, training data volume, and quantization group size.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Achieving binary weight and activation for LLMs using Post-Training Quantization (2025)
- Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining (2025)
- Quartet: Native FP4 Training Can Be Optimal for Large Language Models (2025)
- QUAD: Quantization and Parameter-Efficient Tuning of LLM with Activation Decomposition (2025)
- Precision Where It Matters: A Novel Spike Aware Mixed-Precision Quantization Strategy for LLaMA-based Language Models (2025)
- GranQ: Granular Zero-Shot Quantization with Channel-Wise Activation Scaling in QAT (2025)
- Resource-Efficient Language Models: Quantization for Fast and Accessible Inference (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper