squeeze-ai-lab/dbrx-base-a4-s1

KVQuant is a methodology for efficient KV cache quantization that incorporates several innovations to acheive accurate low-precision quantization, thereby enabling efficient long context length inference.

TLDR: KVQuant addresses the memory bottleneck with long context length inference by quantizing the KV cache to low precision. KVQuant achieves high accuracy with low-precision KV cache quantization by considering several consistent patterns observed in cached KV values across different LLMs, and by developing methods to exploit these patterns, including:

Per-channel, Pre-RoPE Key quantization to better match the outlier channels in Keys
Non-Uniform Quantization (NUQ) to better represent the non-uniform activations
Dense-and-Sparse Quantization to mitigate the impacts of numerical outliers on quantization difficulty
Q-Norm to mitigate distribution shift at ultra low precisions (eg. 2-bit)
Attention-Sink Aware Quantization to avoid quantization error with the first token, which is disproportionately sensitive to quantization error

For more details please check out our paper.

Model description

Quantizer file for running DBRX with 4-bit KV cache using KVQuant.

Base Model: DBRX
Bitwidth: 4-bit
Sparsity Level: 1%

squeeze-ai-lab
/

dbrx-base-a4-s1

Model description

Links

license: mit