Quartet: Native FP4 Training Can Be Optimal for Large Language Models
Abstract
Quartet, a hardware-supported FP4 training approach for large language models, demonstrates state-of-the-art accuracy while significantly reducing computational costs compared to standard or FP8 precision.
The rapid advancement of large language models (LLMs) has been paralleled by unprecedented increases in computational demands, with training costs for state-of-the-art models doubling every few months. Training models directly in low-precision arithmetic offers a solution, by improving both computational throughput and energy efficiency. Specifically, NVIDIA's recent Blackwell architecture facilitates extremely low-precision operations, specifically FP4 variants, promising substantial efficiency gains. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we systematically investigate hardware-supported FP4 training and introduce Quartet, a new approach enabling accurate, end-to-end FP4 training with all the major computations (in e.g. linear layers) being performed in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across varying bit-widths and allows us to identify a "near-optimal" low-precision training technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for NVIDIA Blackwell GPUs, and show that it can achieve state-of-the-art accuracy for FP4 precision, successfully training billion-scale models. Our method demonstrates that fully FP4-based training is a competitive alternative to standard-precision and FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
Community
The RTX5090 kernels will be published next week.
The B200 kernels are still in early stages of development.
Please refer to the repo for updates.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Low-Precision Training of Large Language Models: Methods, Challenges, and Opportunities (2025)
- Scaling Law for Quantization-Aware Training (2025)
- An Extra RMSNorm is All You Need for Fine Tuning to 1.58 Bits (2025)
- Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference (2025)
- Is (Selective) Round-To-Nearest Quantization All You Need? (2025)
- Ecco: Improving Memory Bandwidth and Capacity for LLMs via Entropy-aware Cache Compression (2025)
- Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper