ubergarm
/

GLM-4.5-Air-GGUF

Text Generation

GGUF

imatrix

conversational

ik_llama.cpp

Model card Files Files and versions Community

ubergarm commited on Aug 5

Commit

874fab4

1 Parent(s): c98da53

Updating imatrix and IQ4_KSS

Browse files

Files changed (1) hide show

README.md +17 -58

README.md CHANGED Viewed

@@ -10,11 +10,7 @@ tags:
 - ik_llama.cpp
 ---
-This is an experimental place-holder with an imatrix not for general purpose use just yet. I'm not releasing any quants for this just yet until the various PRs are in place and tested better.
-Check the References below for the the github discussion as folks are working on adding support for this model.
-Keep an eye out for new PR and follow along, once this thing is tested and considered working correctly I hope to release some quants for both this smaller Air model and the larger one too..
 ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
 This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
@@ -36,12 +32,12 @@ Perplexity computed against *wiki.test.raw*.
 ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
 These first two are just test quants for baseline perplexity comparison:
-* `BF16` 203.436 GiB (16.004 BPW)
   - Final estimate: PPL = TODO
-* `Q8_0` 108.119 GiB (8.505 BPW)
   - Final estimate: PPL = TODO
-## IQ4_KSS 54.124 GiB (4.258 BPW)
 <details>
@@ -50,49 +46,15 @@ These first two are just test quants for baseline perplexity comparison:
 ```bash
 #!/usr/bin/env bash
-# 620756992 |  4096, 151552,     1,     1 | Q8_0    | token_embd.weight
-#
-#  44826624 | 10944,   4096,     1,     1 | Q8_0    | blk.0.ffn_down.weight
-#  44826624 |  4096,  10944,     1,     1 | Q8_0    | blk.0.ffn_gate.weight
-#  44826624 |  4096,  10944,     1,     1 | Q8_0    | blk.0.ffn_up.weight
-#      4096 |  4096,      1,     1,     1 | F32     | blk.0.attn_norm.weight
-#      4096 |  4096,      1,     1,     1 | F32     | blk.0.ffn_norm.weight
-#      1024 |  1024,      1,     1,     1 | F32     | blk.0.attn_k.bias
-#   4194304 |  4096,   1024,     1,     1 | Q8_0    | blk.0.attn_k.weight
-#  50331648 | 12288,   4096,     1,     1 | Q8_0    | blk.0.attn_output.weight
-#   4194304 |  4096,   1024,     1,     1 | Q8_0    | blk.0.attn_v.weight
-#  50331648 |  4096,  12288,     1,     1 | Q8_0    | blk.0.attn_q.weight
-#     12288 | 12288,      1,     1,     1 | F32     | blk.0.attn_q.bias
-#      1024 |  1024,      1,     1,     1 | F32     | blk.0.attn_v.bias
-#
-# 738197504 |  1408,   4096,   128,     1 | Q8_0    | blk.1.ffn_down_exps.weight
-# 738197504 |  4096,   1408,   128,     1 | Q8_0    | blk.1.ffn_gate_exps.weight
-# 738197504 |  4096,   1408,   128,     1 | Q8_0    | blk.1.ffn_up_exps.weight
-#      4096 |  4096,      1,     1,     1 | F32     | blk.1.attn_norm.weight
-#       128 |   128,      1,     1,     1 | F32     | blk.1.ffn_gate_inp.bias
-#    524288 |  4096,    128,     1,     1 | F32     | blk.1.ffn_gate_inp.weight
-#   5767168 |  1408,   4096,     1,     1 | Q8_0    | blk.1.ffn_down_shexp.weight
-#   5767168 |  4096,   1408,     1,     1 | Q8_0    | blk.1.ffn_gate_shexp.weight
-#   5767168 |  4096,   1408,     1,     1 | Q8_0    | blk.1.ffn_up_shexp.weight
-#   4194304 |  4096,   1024,     1,     1 | Q8_0    | blk.1.attn_k.weight
-#  50331648 | 12288,   4096,     1,     1 | Q8_0    | blk.1.attn_output.weight
-#  50331648 |  4096,  12288,     1,     1 | Q8_0    | blk.1.attn_q.weight
-#   4194304 |  4096,   1024,     1,     1 | Q8_0    | blk.1.attn_v.weight
-#      4096 |  4096,      1,     1,     1 | F32     | blk.1.ffn_norm.weight
-#      1024 |  1024,      1,     1,     1 | F32     | blk.1.attn_k.bias
-#     12288 | 12288,      1,     1,     1 | F32     | blk.1.attn_q.bias
-#      1024 |  1024,      1,     1,     1 | F32     | blk.1.attn_v.bias
-# 620756992 |  4096, 151552,     1,     1 | Q8_0    | output.weight
 custom="
 # 47 Repeating Layers [0-46]
 # Attention
-#blk\.(0)\.attn_q.*=q8_0
-#blk\.(0)\.attn_k.*=q8_0
-#blk\.(0)\.attn_v.*=q8_0
-#blk\.(0)\.attn_output.*=q8_0
 blk\..*\.attn_q.*=iq5_ks
 blk\..*\.attn_k.*=iq5_ks
@@ -103,14 +65,11 @@ blk\..*\.attn_output.*=iq5_ks
 blk\..*\.ffn_down\.weight=q6_0
 blk\..*\.ffn_(gate|up)\.weight=iq5_ks
-# Shared Expert Layers [2-46]
 blk\..*\.ffn_down_shexp\.weight=q6_0
 blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
-# Routed Experts Layers [2-46]
-#blk\.(3|92)\.ffn_down_exps\.weight=q8_0
-#blk\.(3|92)\.ffn_(gate|up)_exps\.weight=q8_0
 blk\..*\.ffn_down_exps\.weight=iq4_nl
 blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
@@ -124,11 +83,11 @@ custom=$(
   sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
 )
-numactl -N 1 -m 1 \
 ./build/bin/llama-quantize \
     --custom-q "$custom" \
     --imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
-    /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x8.1B-BF16-00001-of-00005.gguf \
     /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
     IQ4_KSS \
     192
@@ -143,7 +102,9 @@ $ git clone https://github.com/ikawrakow/ik_llama.cpp
 $ cd ik_llama.cpp
 $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
 $ git fetch Thireus
-$ git checkout glm-4.5-clean
 # Build for hybrid CPU+CUDA
 $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
@@ -151,13 +112,11 @@ $ cmake --build build --config Release -j $(nproc)
 # Test Experimental GGUF
 $ ./build/bin/llama-server \
-    --model WARNING-EXPERIMENTAL-IKLLAMACPP-ONLY-GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
     --alias ubergarm/GLM-4.5-Air-IQ4_KSS \
     --ctx-size 32768 \
     -fa -fmoe \
     -ctk q8_0 -ctv q8_0 \
-    --chat-template chatglm4 \
-    --override-kv tokenizer.ggml.eot_token_id=int:151336 \
     -ub 4096 -b 4096 \
     -ngl 99 \
     -ot exps=CPU \

 - ik_llama.cpp
 ---
+*Note* The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
 ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
 This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
 These first two are just test quants for baseline perplexity comparison:
+* `BF16` 205.811 GiB (16.004 BPW)
   - Final estimate: PPL = TODO
+* `Q8_0` 109.381 GiB (8.505 BPW)
   - Final estimate: PPL = TODO
+## IQ4_KSS 54.801 GiB (4.261 BPW)
 <details>
 ```bash
 #!/usr/bin/env bash
 custom="
 # 47 Repeating Layers [0-46]
+# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
 # Attention
+blk\.(0|1)\.attn_q.*=q8_0
+blk\.(0|1)\.attn_k.*=q8_0
+blk\.(0|1)\.attn_v.*=q8_0
+blk\.(0|1)\.attn_output.*=q8_0
 blk\..*\.attn_q.*=iq5_ks
 blk\..*\.attn_k.*=iq5_ks
 blk\..*\.ffn_down\.weight=q6_0
 blk\..*\.ffn_(gate|up)\.weight=iq5_ks
+# Shared Expert Layers [1-46]
 blk\..*\.ffn_down_shexp\.weight=q6_0
 blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
+# Routed Experts Layers [1-46]
 blk\..*\.ffn_down_exps\.weight=iq4_nl
 blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
   sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
 )
+numactl -N 0 -m 0 \
 ./build/bin/llama-quantize \
     --custom-q "$custom" \
     --imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
+    /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
     /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
     IQ4_KSS \
     192
 $ cd ik_llama.cpp
 $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
 $ git fetch Thireus
+$ git checkout glm-4.5-testing
+# If glm-4.5-clean is ready, use it instead of -testing
+# $ git checkout glm-4.5-clean
 # Build for hybrid CPU+CUDA
 $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
 # Test Experimental GGUF
 $ ./build/bin/llama-server \
+    --model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
     --alias ubergarm/GLM-4.5-Air-IQ4_KSS \
     --ctx-size 32768 \
     -fa -fmoe \
     -ctk q8_0 -ctv q8_0 \
     -ub 4096 -b 4096 \
     -ngl 99 \
     -ot exps=CPU \