We look forward to a perfect AWQ or GPTQ quantized version.

#3
by su400 - opened

We look forward to a perfect AWQ or GPTQ quantized version. Considering the enhanced programming and mathematical capabilities of the new R1 version, traditional quantization methods may need improvement to preserve the specialized knowledge related to mathematics and programming as much as possible, avoiding compression by quantization. On GPTQ quantized versions released by other organizations, a noticeably higher error rate compared to the official version has been observed during longer programming tasks; this degradation has reached a noticeable level. To maintain programming and mathematical capabilities, a slightly larger memory footprint is acceptable. Taking a single H20 with 768G of VRAM as the baseline, maintaining a 65,535-token context length under such hardware conditions would be excellent.

IST Austria Distributed Algorithms and Systems Lab org

@su400 Currently, evaluations are in progress. Typically, GPTQ/AWQ methods achieve close to 100% recovery (98–99%) for 4-bit weight-only quantization on a selection of reasoning tasks (AIME, GPQA, MATH500). Perfect recovery requires either more sophisticated quantization (such as vector quantization), which results in slower inference, or expensive QAT.

Sign up or log in to comment