Add IQ2_KL and graph

Browse files

Files changed (3) hide show

.gitattributes +1 -0
README.md +47 -0
images/perplexity.png +3 -0

.gitattributes CHANGED Viewed

@@ -35,4 +35,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
 imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
 *.gguf filter=lfs diff=lfs merge=lfs -text
 imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat filter=lfs diff=lfs merge=lfs -text

 *tfevents* filter=lfs diff=lfs merge=lfs -text
 imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
 *.gguf filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text
 imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -28,6 +28,8 @@ Also thanks to all the folks in the quanting and inferencing community on [Beave
 ## Quant Collection
 Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:
 * `bf16` 437.989 GiB (16.003 BPW)
   - Final estimate: PPL = 4.3079 +/- 0.02544
 * `Q8_0` 232.769 GiB (8.505 BPW)
@@ -169,6 +171,51 @@ numactl -N 1 -m 1 \
 </details>
 ## Quick Start
 This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.

 ## Quant Collection
 Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:
+![Perplexity Chart](images/perplexity.png "Chart showing Perplexity improving as BPW increases.")
 * `bf16` 437.989 GiB (16.003 BPW)
   - Final estimate: PPL = 4.3079 +/- 0.02544
 * `Q8_0` 232.769 GiB (8.505 BPW)
 </details>
+## `IQ2_KL` 81.866 GiB (2.991 BPW)
+Final estimate: PPL = 4.7912 +/- 0.02910
+<details>
+<summary>👈 Secret Recipe</summary>
+```bash
+#!/usr/bin/env bash
+# Repeating Layers [0-93]
+custom="
+# Attention
+blk\..*\.attn_q.*=iq6_k
+blk\..*\.attn_k.*=q8_0
+blk\..*\.attn_v.*=q8_0
+blk\..*\.attn_output.*=iq6_k
+# Routed Experts
+blk\..*\.ffn_down_exps\.weight=iq3_ks
+blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
+# Token Embedding
+token_embd\.weight=iq4_k
+output\.weight=iq6_k
+"
+custom=$(
+  echo "$custom" | grep -v '^#' | \
+  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
+)
+numactl -N 0 -m 0 \
+./build/bin/llama-quantize \
+    --custom-q "$custom" \
+    --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/imatrix-Qwen3-235B-A22B-Instruct-2507-BF16.dat \
+    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-BF16-00001-of-00010.gguf \
+    /mnt/raid/models/ubergarm/Qwen3-235B-A22B-Instruct-2507-GGUF/Qwen3-235B-A22B-Instruct-2507-IQ2_KL.gguf \
+    IQ2_KL \
+    192
+```
+</details>
 ## Quick Start
 This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.

images/perplexity.png ADDED Viewed

Git LFS Details

SHA256: 41ecc117d176373aaa9aee060bbbb9383cbec9361385dbf71934596bed6ed537
Pointer size: 131 Bytes
Size of remote file: 134 kB