AWQ 4Bit / GPTQ with full precision gates and head? Please
#4
by
chriswritescode
- opened
I'm super impressed with the Air model. Unfortunately that's the only model that I can run at FP8.
looking for vllm or sglang supported quants please.
looking for vllm or sglang supported quants please.
Working on the following:
AWQ 4-bit, FP16 gates & lm_head
β Model card explicitly lists skip_layers: ["lm_head", "router"], so the router logits and final head remain un-quantized.GPTQ 4-bit, FP16 gates & lm_head
β quantize_config.json shows "true_sequential": true, "lm_head": false, "router": false, keeping those layers in FP16.INT4 w4a16 version and INT8 w8a8 version with 2:4 sparsity. (Targeting Cuda 8.6 architecture)
Will update with once finished.
GPTQ quant would be amazing. need it for my 4 x V100 (Volta architecture)