ubergarm commited on
Commit
7cc02fa
·
1 Parent(s): 1283e26

Add IQ2_KL and graph

Browse files
Files changed (3) hide show
  1. .gitattributes +1 -0
  2. README.md +47 -0
  3. images/perplexity.png +3 -0
.gitattributes CHANGED
@@ -35,4 +35,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
37
  *.gguf filter=lfs diff=lfs merge=lfs -text
 
38
  imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat filter=lfs diff=lfs merge=lfs -text
 
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
  imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
37
  *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
39
  imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -28,6 +28,8 @@ Also thanks to all the folks in the quanting and inferencing community on [Beave
28
  ## Quant Collection
29
  Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:
30
 
 
 
31
  * `bf16` 437.989 GiB (16.003 BPW)
32
  - Final estimate: PPL = 4.3079 +/- 0.02544
33
  * `Q8_0` 232.769 GiB (8.505 BPW)
@@ -169,6 +171,51 @@ numactl -N 1 -m 1 \
169
 
170
  </details>
171
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
172
  ## Quick Start
173
  This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.
174
 
 
28
  ## Quant Collection
29
  Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:
30
 
31
+ ![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
32
+
33
  * `bf16` 437.989 GiB (16.003 BPW)
34
  - Final estimate: PPL = 4.3079 +/- 0.02544
35
  * `Q8_0` 232.769 GiB (8.505 BPW)
 
171
 
172
  </details>
173
 
174
+ ## `IQ2_KL` 81.866 GiB (2.991 BPW)
175
+ Final estimate: PPL = 4.7912 +/- 0.02910
176
+
177
+ <details>
178
+
179
+ <summary>👈 Secret Recipe</summary>
180
+
181
+ ```bash
182
+ #!/usr/bin/env bash
183
+
184
+ # Repeating Layers [0-93]
185
+
186
+ custom="
187
+ # Attention
188
+ blk\..*\.attn_q.*=iq6_k
189
+ blk\..*\.attn_k.*=q8_0
190
+ blk\..*\.attn_v.*=q8_0
191
+ blk\..*\.attn_output.*=iq6_k
192
+
193
+ # Routed Experts
194
+ blk\..*\.ffn_down_exps\.weight=iq3_ks
195
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
196
+
197
+ # Token Embedding
198
+ token_embd\.weight=iq4_k
199
+ output\.weight=iq6_k
200
+ "
201
+
202
+ custom=$(
203
+ echo "$custom" | grep -v '^#' | \
204
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
205
+ )
206
+
207
+ numactl -N 0 -m 0 \
208
+ ./build/bin/llama-quantize \
209
+ --custom-q "$custom" \
210
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
211
+ /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
212
+ /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ2_KL.gguf \
213
+ IQ2_KL \
214
+ 192
215
+ ```
216
+
217
+ </details>
218
+
219
  ## Quick Start
220
  This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.
221
 
images/perplexity.png ADDED

Git LFS Details

  • SHA256: 41ecc117d176373aaa9aee060bbbb9383cbec9361385dbf71934596bed6ed537
  • Pointer size: 131 Bytes
  • Size of remote file: 134 kB