ubergarm commited on
Commit
874fab4
·
1 Parent(s): c98da53

Updating imatrix and IQ4_KSS

Browse files
Files changed (1) hide show
  1. README.md +17 -58
README.md CHANGED
@@ -10,11 +10,7 @@ tags:
10
  - ik_llama.cpp
11
  ---
12
 
13
- This is an experimental place-holder with an imatrix not for general purpose use just yet. I'm not releasing any quants for this just yet until the various PRs are in place and tested better.
14
-
15
- Check the References below for the the github discussion as folks are working on adding support for this model.
16
-
17
- Keep an eye out for new PR and follow along, once this thing is tested and considered working correctly I hope to release some quants for both this smaller Air model and the larger one too..
18
 
19
  ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
20
  This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
@@ -36,12 +32,12 @@ Perplexity computed against *wiki.test.raw*.
36
  ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
37
 
38
  These first two are just test quants for baseline perplexity comparison:
39
- * `BF16` 203.436 GiB (16.004 BPW)
40
  - Final estimate: PPL = TODO
41
- * `Q8_0` 108.119 GiB (8.505 BPW)
42
  - Final estimate: PPL = TODO
43
 
44
- ## IQ4_KSS 54.124 GiB (4.258 BPW)
45
 
46
  <details>
47
 
@@ -50,49 +46,15 @@ These first two are just test quants for baseline perplexity comparison:
50
  ```bash
51
  #!/usr/bin/env bash
52
 
53
- # 620756992 | 4096, 151552, 1, 1 | Q8_0 | token_embd.weight
54
- #
55
- # 44826624 | 10944, 4096, 1, 1 | Q8_0 | blk.0.ffn_down.weight
56
- # 44826624 | 4096, 10944, 1, 1 | Q8_0 | blk.0.ffn_gate.weight
57
- # 44826624 | 4096, 10944, 1, 1 | Q8_0 | blk.0.ffn_up.weight
58
- # 4096 | 4096, 1, 1, 1 | F32 | blk.0.attn_norm.weight
59
- # 4096 | 4096, 1, 1, 1 | F32 | blk.0.ffn_norm.weight
60
- # 1024 | 1024, 1, 1, 1 | F32 | blk.0.attn_k.bias
61
- # 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.0.attn_k.weight
62
- # 50331648 | 12288, 4096, 1, 1 | Q8_0 | blk.0.attn_output.weight
63
- # 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.0.attn_v.weight
64
- # 50331648 | 4096, 12288, 1, 1 | Q8_0 | blk.0.attn_q.weight
65
- # 12288 | 12288, 1, 1, 1 | F32 | blk.0.attn_q.bias
66
- # 1024 | 1024, 1, 1, 1 | F32 | blk.0.attn_v.bias
67
- #
68
- # 738197504 | 1408, 4096, 128, 1 | Q8_0 | blk.1.ffn_down_exps.weight
69
- # 738197504 | 4096, 1408, 128, 1 | Q8_0 | blk.1.ffn_gate_exps.weight
70
- # 738197504 | 4096, 1408, 128, 1 | Q8_0 | blk.1.ffn_up_exps.weight
71
- # 4096 | 4096, 1, 1, 1 | F32 | blk.1.attn_norm.weight
72
- # 128 | 128, 1, 1, 1 | F32 | blk.1.ffn_gate_inp.bias
73
- # 524288 | 4096, 128, 1, 1 | F32 | blk.1.ffn_gate_inp.weight
74
- # 5767168 | 1408, 4096, 1, 1 | Q8_0 | blk.1.ffn_down_shexp.weight
75
- # 5767168 | 4096, 1408, 1, 1 | Q8_0 | blk.1.ffn_gate_shexp.weight
76
- # 5767168 | 4096, 1408, 1, 1 | Q8_0 | blk.1.ffn_up_shexp.weight
77
- # 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.1.attn_k.weight
78
- # 50331648 | 12288, 4096, 1, 1 | Q8_0 | blk.1.attn_output.weight
79
- # 50331648 | 4096, 12288, 1, 1 | Q8_0 | blk.1.attn_q.weight
80
- # 4194304 | 4096, 1024, 1, 1 | Q8_0 | blk.1.attn_v.weight
81
- # 4096 | 4096, 1, 1, 1 | F32 | blk.1.ffn_norm.weight
82
- # 1024 | 1024, 1, 1, 1 | F32 | blk.1.attn_k.bias
83
- # 12288 | 12288, 1, 1, 1 | F32 | blk.1.attn_q.bias
84
- # 1024 | 1024, 1, 1, 1 | F32 | blk.1.attn_v.bias
85
-
86
- # 620756992 | 4096, 151552, 1, 1 | Q8_0 | output.weight
87
-
88
  custom="
89
  # 47 Repeating Layers [0-46]
 
90
 
91
  # Attention
92
- #blk\.(0)\.attn_q.*=q8_0
93
- #blk\.(0)\.attn_k.*=q8_0
94
- #blk\.(0)\.attn_v.*=q8_0
95
- #blk\.(0)\.attn_output.*=q8_0
96
 
97
  blk\..*\.attn_q.*=iq5_ks
98
  blk\..*\.attn_k.*=iq5_ks
@@ -103,14 +65,11 @@ blk\..*\.attn_output.*=iq5_ks
103
  blk\..*\.ffn_down\.weight=q6_0
104
  blk\..*\.ffn_(gate|up)\.weight=iq5_ks
105
 
106
- # Shared Expert Layers [2-46]
107
  blk\..*\.ffn_down_shexp\.weight=q6_0
108
  blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
109
 
110
- # Routed Experts Layers [2-46]
111
- #blk\.(3|92)\.ffn_down_exps\.weight=q8_0
112
- #blk\.(3|92)\.ffn_(gate|up)_exps\.weight=q8_0
113
-
114
  blk\..*\.ffn_down_exps\.weight=iq4_nl
115
  blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
116
 
@@ -124,11 +83,11 @@ custom=$(
124
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
125
  )
126
 
127
- numactl -N 1 -m 1 \
128
  ./build/bin/llama-quantize \
129
  --custom-q "$custom" \
130
  --imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
131
- /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x8.1B-BF16-00001-of-00005.gguf \
132
  /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
133
  IQ4_KSS \
134
  192
@@ -143,7 +102,9 @@ $ git clone https://github.com/ikawrakow/ik_llama.cpp
143
  $ cd ik_llama.cpp
144
  $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
145
  $ git fetch Thireus
146
- $ git checkout glm-4.5-clean
 
 
147
 
148
  # Build for hybrid CPU+CUDA
149
  $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
@@ -151,13 +112,11 @@ $ cmake --build build --config Release -j $(nproc)
151
 
152
  # Test Experimental GGUF
153
  $ ./build/bin/llama-server \
154
- --model WARNING-EXPERIMENTAL-IKLLAMACPP-ONLY-GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
155
  --alias ubergarm/GLM-4.5-Air-IQ4_KSS \
156
  --ctx-size 32768 \
157
  -fa -fmoe \
158
  -ctk q8_0 -ctv q8_0 \
159
- --chat-template chatglm4 \
160
- --override-kv tokenizer.ggml.eot_token_id=int:151336 \
161
  -ub 4096 -b 4096 \
162
  -ngl 99 \
163
  -ot exps=CPU \
 
10
  - ik_llama.cpp
11
  ---
12
 
13
+ *Note* The ik_llama.cpp PR is still in progress for support in main branch. Until then follow instructions here and keep an eye on the PR: https://github.com/ikawrakow/ik_llama.cpp/pull/668
 
 
 
 
14
 
15
  ## `ik_llama.cpp` imatrix Quantizations of zai-org/GLM-4.5-Air
16
  This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
 
32
  ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
33
 
34
  These first two are just test quants for baseline perplexity comparison:
35
+ * `BF16` 205.811 GiB (16.004 BPW)
36
  - Final estimate: PPL = TODO
37
+ * `Q8_0` 109.381 GiB (8.505 BPW)
38
  - Final estimate: PPL = TODO
39
 
40
+ ## IQ4_KSS 54.801 GiB (4.261 BPW)
41
 
42
  <details>
43
 
 
46
  ```bash
47
  #!/usr/bin/env bash
48
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
49
  custom="
50
  # 47 Repeating Layers [0-46]
51
+ # Note: All ffn_down.* layers are not divisible by 256 so have limited quantization options.
52
 
53
  # Attention
54
+ blk\.(0|1)\.attn_q.*=q8_0
55
+ blk\.(0|1)\.attn_k.*=q8_0
56
+ blk\.(0|1)\.attn_v.*=q8_0
57
+ blk\.(0|1)\.attn_output.*=q8_0
58
 
59
  blk\..*\.attn_q.*=iq5_ks
60
  blk\..*\.attn_k.*=iq5_ks
 
65
  blk\..*\.ffn_down\.weight=q6_0
66
  blk\..*\.ffn_(gate|up)\.weight=iq5_ks
67
 
68
+ # Shared Expert Layers [1-46]
69
  blk\..*\.ffn_down_shexp\.weight=q6_0
70
  blk\..*\.ffn_(gate|up)_shexp\.weight=iq5_ks
71
 
72
+ # Routed Experts Layers [1-46]
 
 
 
73
  blk\..*\.ffn_down_exps\.weight=iq4_nl
74
  blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kss
75
 
 
83
  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
84
  )
85
 
86
+ numactl -N 0 -m 0 \
87
  ./build/bin/llama-quantize \
88
  --custom-q "$custom" \
89
  --imatrix /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/imatrix-GLM-4.5-Air-BF16.dat \
90
+ /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-128x9.4B-BF16-00001-of-00005.gguf \
91
  /mnt/raid/models/ubergarm/GLM-4.5-Air-GGUF/GLM-4.5-Air-IQ4_KSS.gguf \
92
  IQ4_KSS \
93
  192
 
102
  $ cd ik_llama.cpp
103
  $ git remote add Thireus https://github.com/Thireus/ik_llama.cpp.git
104
  $ git fetch Thireus
105
+ $ git checkout glm-4.5-testing
106
+ # If glm-4.5-clean is ready, use it instead of -testing
107
+ # $ git checkout glm-4.5-clean
108
 
109
  # Build for hybrid CPU+CUDA
110
  $ cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
 
112
 
113
  # Test Experimental GGUF
114
  $ ./build/bin/llama-server \
115
+ --model GLM-4.5-Air-IQ4_KSS-00001-of-00002.gguf \
116
  --alias ubergarm/GLM-4.5-Air-IQ4_KSS \
117
  --ctx-size 32768 \
118
  -fa -fmoe \
119
  -ctk q8_0 -ctv q8_0 \
 
 
120
  -ub 4096 -b 4096 \
121
  -ngl 99 \
122
  -ot exps=CPU \