zai-org/GLM-4.5 · AWQ 4Bit / GPTQ with full precision gates and head? Please

about 18 hours ago

I'm super impressed with the Air model. Unfortunately that's the only model that I can run at FP8.

Z.ai org about 17 hours ago

about 17 hours ago

looking for vllm or sglang supported quants please.

about 12 hours ago

•

looking for vllm or sglang supported quants please.

Working on the following:

AWQ 4-bit, FP16 gates & lm_head
– Model card explicitly lists skip_layers: ["lm_head", "router"], so the router logits and final head remain un-quantized.
GPTQ 4-bit, FP16 gates & lm_head
– quantize_config.json shows "true_sequential": true, "lm_head": false, "router": false, keeping those layers in FP16.
INT4 w4a16 version and INT8 w8a8 version with 2:4 sparsity. (Targeting Cuda 8.6 architecture)

Will update with once finished.

about 10 hours ago

GPTQ quant would be amazing. need it for my 4 x V100 (Volta architecture)