AWQ 4Bit / GPTQ with full precision gates and head? Please

#4
by chriswritescode - opened

I'm super impressed with the Air model. Unfortunately that's the only model that I can run at FP8.

looking for vllm or sglang supported quants please.

looking for vllm or sglang supported quants please.

Working on the following:

  • AWQ 4-bit, FP16 gates & lm_head
    – Model card explicitly lists skip_layers: ["lm_head", "router"], so the router logits and final head remain un-quantized.

  • GPTQ 4-bit, FP16 gates & lm_head
    – quantize_config.json shows "true_sequential": true, "lm_head": false, "router": false, keeping those layers in FP16.

  • INT4 w4a16 version and INT8 w8a8 version with 2:4 sparsity. (Targeting Cuda 8.6 architecture)

Will update with once finished.

GPTQ quant would be amazing. need it for my 4 x V100 (Volta architecture)

Sign up or log in to comment