Updating imatrix and IQ4_KSS
Browse files
README.md
CHANGED
@@ -10,11 +10,7 @@ tags:
|
|
10 |
- ik_llama.cpp
|
11 |
---
|
12 |
|
13 |
-
|
14 |
-
|
15 |
-
Check the References below for the the github discussion as folks are working on adding support for this model.
|
16 |
-
|
17 |
-
Keep an eye out for new PR and follow along, once this thing is tested and considered working correctly I hope to release some quants for both this smaller Air model and the larger one too..
|
18 |
|
19 |
## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
|
20 |
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
@@ -36,12 +32,12 @@ Perplexity computed against *wiki.test.raw*.
|
|
36 |

|
37 |
|
38 |
These first two are just test quants for baseline perplexity comparison:
|
39 |
-
* `BF16`
|
40 |
- Final estimate: PPL = TODO
|
41 |
-
* `Q8_0`
|
42 |
- Final estimate: PPL = TODO
|
43 |
|
44 |
-
## IQ4_KSS 54.
|
45 |
|
46 |
<details>
|
47 |
|
@@ -50,49 +46,15 @@ These first two are just test quants for baseline perplexity comparison:
|
|
50 |
```bash
|
51 |
#!/usr/bin/env bash
|
52 |
|
53 |
-
# 620756992 | 4096, 151552, 1, 1 | Q8_0 | token_embd.weight
|
54 |
-
#
|
55 |
-
# 44826624 | 10944, 4096, 1, 1 | Q8_0 | blk.0.ffn_down.weight
|
56 |
-
# 44826624 | 4096, 10944, 1, 1 | Q8_0 | blk.0.ffn_gate.weight
|
57 |
-
# 44826624 | 4096, 10944, 1, 1 | Q8_0 | blk.0.ffn_up.weight
|
58 |
-
# 4096 | 4096, 1, 1, 1 | F32 | blk.0.attn_norm.weight
|
59 |
-
# 4096 | 4096, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
|
60 |
-
# 1024 | 1024, 1, 1, 1 | F32 | blk.0.attn_k.bias
|
61 |
-
# 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.0.attn_k.weight
|
62 |
-
# 50331648 | 12288, 4096, 1, 1 | Q8_0 | blk.0.attn_output.weight
|
63 |
-
# 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.0.attn_v.weight
|
64 |
-
# 50331648 | 4096, 12288, 1, 1 | Q8_0 | blk.0.attn_q.weight
|
65 |
-
# 12288 | 12288, 1, 1, 1 | F32 | blk.0.attn_q.bias
|
66 |
-
# 1024 | 1024, 1, 1, 1 | F32 | blk.0.attn_v.bias
|
67 |
-
#
|
68 |
-
# 738197504 | 1408, 4096, 128, 1 | Q8_0 | blk.1.ffn_down_exps.weight
|
69 |
-
# 738197504 | 4096, 1408, 128, 1 | Q8_0 | blk.1.ffn_gate_exps.weight
|
70 |
-
# 738197504 | 4096, 1408, 128, 1 | Q8_0 | blk.1.ffn_up_exps.weight
|
71 |
-
# 4096 | 4096, 1, 1, 1 | F32 | blk.1.attn_norm.weight
|
72 |
-
# 128 | 128, 1, 1, 1 | F32 | blk.1.ffn_gate_inp.bias
|
73 |
-
# 524288 | 4096, 128, 1, 1 | F32 | blk.1.ffn_gate_inp.weight
|
74 |
-
# 5767168 | 1408, 4096, 1, 1 | Q8_0 | blk.1.ffn_down_shexp.weight
|
75 |
-
# 5767168 | 4096, 1408, 1, 1 | Q8_0 | blk.1.ffn_gate_shexp.weight
|
76 |
-
# 5767168 | 4096, 1408, 1, 1 | Q8_0 | blk.1.ffn_up_shexp.weight
|
77 |
-
# 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.1.attn_k.weight
|
78 |
-
# 50331648 | 12288, 4096, 1, 1 | Q8_0 | blk.1.attn_output.weight
|
79 |
-
# 50331648 | 4096, 12288, 1, 1 | Q8_0 | blk.1.attn_q.weight
|
80 |
-
# 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.1.attn_v.weight
|
81 |
-
# 4096 | 4096, 1, 1, 1 | F32 | blk.1.ffn_norm.weight
|
82 |
-
# 1024 | 1024, 1, 1, 1 | F32 | blk.1.attn_k.bias
|
83 |
-
# 12288 | 12288, 1, 1, 1 | F32 | blk.1.attn_q.bias
|
84 |
-
# 1024 | 1024, 1, 1, 1 | F32 | blk.1.attn_v.bias
|
85 |
-
|
86 |
-
# 620756992 | 4096, 151552, 1, 1 | Q8_0 | output.weight
|
87 |
-
|
88 |
custom="
|
89 |
# 47 Repeating Layers [0-46]
|
|
|
90 |
|
91 |
# Attention
|
92 |
-
|
93 |
-
|
94 |
-
|
95 |
-
|
96 |
|
97 |
blk\..*\.attn_q.*=iq5_ks
|
98 |
blk\..*\.attn_k.*=iq5_ks
|
@@ -103,14 +65,11 @@ blk\..*\.attn_output.*=iq5_ks
|
|
103 |
blk\..*\.ffn_down\.weight=q6_0
|
104 |
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
|
105 |
|
106 |
-
# Shared Expert Layers [
|
107 |
blk\..*\.ffn_down_shexp\.weight=q6_0
|
108 |
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
|
109 |
|
110 |
-
# Routed Experts Layers [
|
111 |
-
#blk\.(3|92)\.ffn_down_exps\.weight=q8_0
|
112 |
-
#blk\.(3|92)\.ffn_(gate|up)_exps\.weight=q8_0
|
113 |
-
|
114 |
blk\..*\.ffn_down_exps\.weight=iq4_nl
|
115 |
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
|
116 |
|
@@ -124,11 +83,11 @@ custom=$(
|
|
124 |
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
125 |
)
|
126 |
|
127 |
-
numactl -N
|
128 |
./build/bin/llama-quantize \
|
129 |
--custom-q "$custom" \
|
130 |
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
|
131 |
-
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-
|
132 |
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
|
133 |
IQ4_KSS \
|
134 |
192
|
@@ -143,7 +102,9 @@ $ git clone https://github.com/ikawrakow/ik_llama.cpp
|
|
143 |
$ cd ik_llama.cpp
|
144 |
$ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
|
145 |
$ git fetch Thireus
|
146 |
-
$ git checkout glm-4.5-
|
|
|
|
|
147 |
|
148 |
# Build for hybrid CPU+CUDA
|
149 |
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
@@ -151,13 +112,11 @@ $ cmake --build build --config Release -j $(nproc)
|
|
151 |
|
152 |
# Test Experimental GGUF
|
153 |
$ ./build/bin/llama-server \
|
154 |
-
--model
|
155 |
--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
|
156 |
--ctx-size 32768 \
|
157 |
-fa -fmoe \
|
158 |
-ctk q8_0 -ctv q8_0 \
|
159 |
-
--chat-template chatglm4 \
|
160 |
-
--override-kv tokenizer.ggml.eot_token_id=int:151336 \
|
161 |
-ub 4096 -b 4096 \
|
162 |
-ngl 99 \
|
163 |
-ot exps=CPU \
|
|
|
10 |
- ik_llama.cpp
|
11 |
---
|
12 |
|
13 |
+
*Note* The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
|
|
|
|
|
|
|
|
|
14 |
|
15 |
## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
|
16 |
This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
|
|
|
32 |

|
33 |
|
34 |
These first two are just test quants for baseline perplexity comparison:
|
35 |
+
* `BF16` 205.811 GiB (16.004 BPW)
|
36 |
- Final estimate: PPL = TODO
|
37 |
+
* `Q8_0` 109.381 GiB (8.505 BPW)
|
38 |
- Final estimate: PPL = TODO
|
39 |
|
40 |
+
## IQ4_KSS 54.801 GiB (4.261 BPW)
|
41 |
|
42 |
<details>
|
43 |
|
|
|
46 |
```bash
|
47 |
#!/usr/bin/env bash
|
48 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
49 |
custom="
|
50 |
# 47 Repeating Layers [0-46]
|
51 |
+
# Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
|
52 |
|
53 |
# Attention
|
54 |
+
blk\.(0|1)\.attn_q.*=q8_0
|
55 |
+
blk\.(0|1)\.attn_k.*=q8_0
|
56 |
+
blk\.(0|1)\.attn_v.*=q8_0
|
57 |
+
blk\.(0|1)\.attn_output.*=q8_0
|
58 |
|
59 |
blk\..*\.attn_q.*=iq5_ks
|
60 |
blk\..*\.attn_k.*=iq5_ks
|
|
|
65 |
blk\..*\.ffn_down\.weight=q6_0
|
66 |
blk\..*\.ffn_(gate|up)\.weight=iq5_ks
|
67 |
|
68 |
+
# Shared Expert Layers [1-46]
|
69 |
blk\..*\.ffn_down_shexp\.weight=q6_0
|
70 |
blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
|
71 |
|
72 |
+
# Routed Experts Layers [1-46]
|
|
|
|
|
|
|
73 |
blk\..*\.ffn_down_exps\.weight=iq4_nl
|
74 |
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
|
75 |
|
|
|
83 |
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
84 |
)
|
85 |
|
86 |
+
numactl -N 0 -m 0 \
|
87 |
./build/bin/llama-quantize \
|
88 |
--custom-q "$custom" \
|
89 |
--imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
|
90 |
+
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
|
91 |
/mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
|
92 |
IQ4_KSS \
|
93 |
192
|
|
|
102 |
$ cd ik_llama.cpp
|
103 |
$ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
|
104 |
$ git fetch Thireus
|
105 |
+
$ git checkout glm-4.5-testing
|
106 |
+
# If glm-4.5-clean is ready, use it instead of -testing
|
107 |
+
# $ git checkout glm-4.5-clean
|
108 |
|
109 |
# Build for hybrid CPU+CUDA
|
110 |
$ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
|
|
|
112 |
|
113 |
# Test Experimental GGUF
|
114 |
$ ./build/bin/llama-server \
|
115 |
+
--model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
|
116 |
--alias ubergarm/GLM-4.5-Air-IQ4_KSS \
|
117 |
--ctx-size 32768 \
|
118 |
-fa -fmoe \
|
119 |
-ctk q8_0 -ctv q8_0 \
|
|
|
|
|
120 |
-ub 4096 -b 4096 \
|
121 |
-ngl 99 \
|
122 |
-ot exps=CPU \
|