SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training
Abstract
Efficiency enhancements for attention mechanisms, including leveraging FP4 Tensor Cores and developing an 8-bit attention method, improve inference and training performance.
The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new FP4 Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves 1038 TOPS on RTX5090, which is a 5x speedup over the fastest FlashAttention on RTX5090. Experiments show that our FP4 attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient 8-bit attention for both forward and backward propagation. Experiments indicate that 8-bit attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code will be available at https://github.com/thu-ml/SageAttention.
Community
SageAttention3: Microscaling FP4 Attention for inference with a 5x speedup and an 8-bit Attention for Training.
The code will be available at https://github.com/thu-ml/SageAttention.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- BitNet v2: Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs (2025)
- FlashBias: Fast Computation of Attention with Bias (2025)
- SpecOffload: Unlocking Latent GPU Capacity for LLM Inference on Resource-Constrained Devices (2025)
- Distillation-Supervised Convolutional Low-Rank Adaptation for Efficient Image Super-Resolution (2025)
- Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis (2025)
- Enhancing Ultra-Low-Bit Quantization of Large Language Models Through Saliency-Aware Partial Retraining (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper