quantized_by: ubergarm
pipeline_tag: text-generation
base_model: zai-org/GLM-4.5-Air
license: mit
base_model_relation: quantized
tags:
- imatrix
- conversational
- ik_llama.cpp
Note The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
ik_llama.cpp
imatrix Quantizations of zai-org/GLM-4.5-Air
This quant collection REQUIRES ik_llama.cpp fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
NOTE ik_llama.cpp
can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
Some of ik's new quants are supported with Nexesenex/croco.cpp fork of KoboldCPP.
These quants provide best in class perplexity for the given memory footprint.
Big Thanks
Shout out to Wendell and the Level1Techs crew, the community Forums, YouTube Channel! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!
Also thanks to all the folks in the quanting and inferencing community on BeaverAI Club Discord and on r/LocalLLaMA for tips and tricks helping each other run, test, and benchmark all the fun new models!
Quant Collection
Perplexity computed against wiki.test.raw.
These first two are just test quants for baseline perplexity comparison:
BF16
205.811 GiB (16.004 BPW)- Final estimate: PPL = TODO
Q8_0
109.381 GiB (8.505 BPW)- Final estimate: PPL = TODO
IQ4_KSS 54.801 GiB (4.261 BPW)
👈 Secret Recipe
#!/usr/bin/env bash
custom="
# 47 Repeating Layers [0-46]
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
# Attention
blk\.(0|1)\.attn_q.*=q8_0
blk\.(0|1)\.attn_k.*=q8_0
blk\.(0|1)\.attn_v.*=q8_0
blk\.(0|1)\.attn_output.*=q8_0
blk\..*\.attn_q.*=iq5_ks
blk\..*\.attn_k.*=iq5_ks
blk\..*\.attn_v.*=iq5_ks
blk\..*\.attn_output.*=iq5_ks
# First 1 Dense Layers [0]
blk\..*\.ffn_down\.weight=q6_0
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
# Shared Expert Layers [1-46]
blk\..*\.ffn_down_shexp\.weight=q6_0
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
# Routed Experts Layers [1-46]
blk\..*\.ffn_down_exps\.weight=iq4_nl
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
# Non-Repeating Layers
token_embd\.weight=iq4_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
numactl -N 0 -m 0 \
./build/bin/llama-quantize \
--custom-q "$custom" \
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
IQ4_KSS \
192
Quick Start
# Clone and checkout experimental PR
$ git clone https://github.com/ikawrakow/ik_llama.cpp
$ cd ik_llama.cpp
$ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
$ git fetch Thireus
$ git checkout glm-4.5-testing
# If glm-4.5-clean is ready, use it instead of -testing
# $ git checkout glm-4.5-clean
# Build for hybrid CPU+CUDA
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
$ cmake --build build --config Release -j $(nproc)
# Test Experimental GGUF
$ ./build/bin/llama-server \
--model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
--ctx-size 32768 \
-fa -fmoe \
-ctk q8_0 -ctv q8_0 \
-ub 4096 -b 4096 \
-ngl 99 \
-ot exps=CPU \
--parallel 1 \
--threads 8 \
--host 127.0.0.1 \
--port 8080 \
--no-mmap