Papers
arxiv:2505.15909

Is (Selective) Round-To-Nearest Quantization All You Need?

Published on May 21
Authors:

Abstract

Round-to-Nearest (RTN) quantization, when implemented with strategic precision adjustments, offers a cost-effective and high-throughput alternative for quantizing Large Language Models.

AI-generated summary

Quantization became a necessary tool for serving ever-increasing Large Language Models (LLMs). RTN (Round-to-Nearest) is perhaps the simplest quantization technique that has been around well before LLMs surged to the forefront of machine learning (ML) research. Yet, it has been largely dismissed by recent and more advanced quantization methods that claim superiority over RTN in nearly every aspect of performance. This work aims to dispel this established point of view, showing that RTN is not only much cheaper to apply, but also its token generation throughput can be better than and accuracy can be similar to more advanced alternatives. In particular, we discuss our implementation of RTN based on the recent Marlin kernels and demonstrate how the accuracy of RTN can be gradually improved by selectively increasing the data precision format of certain model layers and modules. Based on our results, we argue that RTN presents a viable and practical choice for quantizing LLMs.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2505.15909 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2505.15909 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2505.15909 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.