GLM-4.5-Air-GGUF / README.md

fixup readme

e0bb40e 29 days ago

13.8 kB

	---
	quantized_by: ubergarm
	pipeline_tag: text-generation
	base_model: zai-org/GLM-4.5-Air
	license: mit
	base_model_relation: quantized
	tags:
	- imatrix
	- conversational
	- ik_llama.cpp
	---

	## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
	This quant collection REQUIRES [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do not download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!

	NOTE `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.

	Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP with Windows builds for CUDA 12.9. Also check for [Windows builds by Thireus here.](https://github.com/Thireus/ik_llama.cpp/releases) which have been CUDA 12.8.

	These quants provide best in class perplexity for the given memory footprint.

	## Big Thanks
	Shout out to Wendell and the Level1Techs crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! BIG thanks for providing BIG hardware expertise and access to run these experiments and make these great quants available to the community!!!

	Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!

	## Quant Collection
	Perplexity computed against wiki.test.raw.

	![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")

	These first two are just test quants for baseline perplexity comparison:
	* `BF16` 205.811 GiB (16.004 BPW)
	- Final estimate: PPL = 4.5704 +/- 0.02796

	* `Q8_0` 109.381 GiB (8.505 BPW)
	- Final estimate: PPL = 4.5798 +/- 0.02804

	## IQ5_K 77.704 GiB (6.042 BPW)
	Final estimate: PPL = 4.5867 +/- 0.02806

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\..\.attn_q.=q8_0
	blk\..\.attn_k.=q8_0
	blk\..\.attn_v.=q8_0
	blk\..\.attn_output.=q8_0

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=q8_0
	blk\..*\.ffn_(gate\|up)\.weight=q8_0

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=q8_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=q8_0

	# Routed Experts Layers [1-46]
	blk\.(1)\.ffn_down_exps\.weight=q8_0
	blk\.(1)\.ffn_(gate\|up)_exps\.weight=q8_0

	blk\..*\.ffn_down_exps\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq5_k

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq6_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ5_K.gguf \
	IQ5_K \
	192
	```

	</details>

	## IQ5_KS 72.855 GiB (5.665 BPW)
	Final estimate: PPL = 4.5948 +/- 0.02815

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\..\.attn_q.=iq5_ks
	blk\..\.attn_k.=q8_0
	blk\..\.attn_v.=q8_0
	blk\..\.attn_output.=iq5_ks

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=q6_0
	blk\..*\.ffn_(gate\|up)\.weight=iq5_ks

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq5_ks

	# Routed Experts Layers [1-46]
	blk\..*\.ffn_down_exps\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq5_ks

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ5_KS.gguf \
	IQ5_KS \
	192
	```

	</details>

	## IQ4_K 62.910 GiB (4.892 BPW)
	Final estimate: PPL = 4.6273 +/- 0.02839

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\..\.attn_q.=iq5_ks
	blk\..\.attn_k.=q8_0
	blk\..\.attn_v.=q8_0
	blk\..\.attn_output.=iq5_ks

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=q6_0
	blk\..*\.ffn_(gate\|up)\.weight=iq5_ks

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq5_ks

	# Routed Experts Layers [1-46]
	blk\..*\.ffn_down_exps\.weight=q5_0
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_k

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 1 -m 1 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_K.gguf \
	IQ4_K \
	192
	```

	</details>

	## IQ4_KSS 54.801 GiB (4.261 BPW)
	Final estimate: PPL = 4.7056 +/- 0.02909

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\.(0\|1)\.attn_q.*=q8_0
	blk\.(0\|1)\.attn_k.*=q8_0
	blk\.(0\|1)\.attn_v.*=q8_0
	blk\.(0\|1)\.attn_output.*=q8_0

	blk\..\.attn_q.=iq5_ks
	blk\..\.attn_k.=iq5_ks
	blk\..\.attn_v.=iq5_ks
	blk\..\.attn_output.=iq5_ks

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=q6_0
	blk\..*\.ffn_(gate\|up)\.weight=iq5_ks

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq5_ks

	# Routed Experts Layers [1-46]
	#blk\.(1\|46)\.ffn_down_exps\.weight=q8_0
	#blk\.(1\|46)\.ffn_(gate\|up)_exps\.weight=q8_0

	blk\..*\.ffn_down_exps\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq4_kss

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
	IQ4_KSS \
	192
	```

	</details>

	## IQ3_KS 49.072 GiB (3.816 BPW)
	Final estimate: PPL = 4.7975 +/- 0.02972

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\.(0\|1)\.attn_q.*=q8_0
	blk\.(0\|1)\.attn_k.*=q8_0
	blk\.(0\|1)\.attn_v.*=q8_0
	blk\.(0\|1)\.attn_output.*=q8_0

	blk\..\.attn_q.=iq5_ks
	blk\..\.attn_k.=iq5_ks
	blk\..\.attn_v.=iq5_ks
	blk\..\.attn_output.=iq5_ks

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=q6_0
	blk\..*\.ffn_(gate\|up)\.weight=iq5_ks

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=q6_0
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq5_ks

	# Routed Experts Layers [1-46]
	blk\.(1)\.ffn_down_exps\.weight=q6_0
	blk\.(1)\.ffn_(gate\|up)_exps\.weight=iq5_ks

	blk\..*\.ffn_down_exps\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq3_ks

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq5_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq5_ks
	blk\..*\.nextn\.eh_proj\.weight=q8_0
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-PR624-IQ3_KS.gguf \
	IQ3_KS \
	192
	```

	</details>


	## IQ2_KL 43.870 GiB (3.411 BPW)
	Final estimate: PPL = 5.0697 +/- 0.03166

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\..\.attn_q.=iq4_ks
	blk\..\.attn_k.=iq5_ks
	blk\..\.attn_v.=iq5_ks
	blk\..\.attn_output.=iq4_ks

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)\.weight=iq4_kss

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq4_kss

	# Routed Experts Layers [1-46]
	blk\.(1)\.ffn_down_exps\.weight=iq4_nl
	blk\.(1)\.ffn_(gate\|up)_exps\.weight=iq4_kss

	blk\..*\.ffn_down_exps\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq2_kl

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq4_ks
	blk\..*\.nextn\.shared_head_head\.weight=iq4_ks
	blk\..*\.nextn\.eh_proj\.weight=q6_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 0 -m 0 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ2_KL.gguf \
	IQ2_KL \
	192
	```

	</details>

	## IQ1_KT 36.039 GiB (2.802 BPW)
	Final estimate: PPL = 5.8214 +/- 0.03767

	<details>

	<summary>👈 Secret Recipe</summary>

	```bash
	#!/usr/bin/env bash

	custom="
	# 47 Repeating Layers [0-46]
	# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.

	# Attention
	blk\..\.attn_q.=iq4_kt
	blk\..\.attn_k.=iq4_kt
	blk\..\.attn_v.=iq4_kt
	blk\..\.attn_output.=iq4_kt

	# First 1 Dense Layers [0]
	blk\..*\.ffn_down\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)\.weight=iq4_kt

	# Shared Expert Layers [1-46]
	blk\..*\.ffn_down_shexp\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_shexp\.weight=iq4_kt

	# Routed Experts Layers [1-46]
	blk\..*\.ffn_down_exps\.weight=iq4_nl
	blk\..*\.ffn_(gate\|up)_exps\.weight=iq1_kt

	# NextN MTP Layer [46]
	blk\..*\.nextn\.embed_tokens\.weight=iq4_kt
	blk\..*\.nextn\.shared_head_head\.weight=iq4_kt
	blk\..*\.nextn\.eh_proj\.weight=q8_0

	# Non-Repeating Layers
	token_embd\.weight=iq4_k
	output\.weight=iq6_k
	"

	custom=$(
	echo "$custom" \| grep -v '^#' \| \
	sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
	)

	numactl -N 1 -m 1 \
	./build/bin/llama-quantize \
	--custom-q "$custom" \
	--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
	/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ1_KT.gguf \
	IQ1_KT \
	192
	```

	</details>


	## Quick Start
	If you want to disable thinking, add `/nothink` (correct, no underscore) at the end of your prompt.

	```bash
	# Clone and checkout
	$ git clone https://github.com/ikawrakow/ik_llama.cpp
	$ cd ik_llama.cpp

	# Build for hybrid CPU+CUDA
	$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
	$ cmake --build build --config Release -j $(nproc)

	# Run API server
	$ ./build/bin/llama-server \
	--model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
	--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
	--chat-template chatglm4 \
	--ctx-size 32768 \
	-fa -fmoe \
	-ctk q8_0 -ctv q8_0 \
	-ub 4096 -b 4096 \
	-ngl 99 \
	-ot exps=CPU \
	--parallel 1 \
	--threads 8 \
	--host 127.0.0.1 \
	--port 8080 \
	--no-mmap
	```

	## References
	* [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
	* [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
	* [ubergarm-imatrix-calibration-corpus-v02.txt](https://gist.github.com/ubergarm/edfeb3ff9c6ec8b49e88cdf627b0711a?permalink_comment_id=5682584#gistcomment-5682584)
	* [Mainline llama.cpp Draft PR14939](https://github.com/ggml-org/llama.cpp/pull/14939)
	* [ik_llama.cpp GLM-4.5 MoE PR668](https://github.com/ikawrakow/ik_llama.cpp/pull/668)