upload perplexity graph

Browse files

Files changed (2) hide show

README.md +11 -115
images/perplexity.png +3 -0

README.md CHANGED Viewed

@@ -33,14 +33,14 @@ Perplexity computed against *wiki.test.raw*.
 These first three are just test quants for baseline perplexity comparison:
 * `bf16` 56.894 GiB (16.007 BPW)
-  - Final estimate: PPL = TODO
 * `Q8_0` 30.247 GiB (8.510 BPW)
-  - Final estimate: PPL = TODO
 * `Q4_0` 16.111 GiB (4.533 BPW)
-  - Final estimate: PPL = TODO
 ## `IQ5_K` 21.324 GiB (5.999 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -92,7 +92,7 @@ custom=$(
 </details>
 ## `IQ4_K` 17.878 GiB (5.030 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -144,7 +144,7 @@ custom=$(
 </details>
 ## `IQ4_KSS` 15.531 GiB (4.370 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -195,106 +195,8 @@ custom=$(
 </details>
-## `IQ4_KT` 14.438 GiB (4.062 BPW)
-Final estimate: PPL = TODO
-Mostly pure IQ4_KT meant for full GPU offload similar to [turboderp-org/exllamav3](https://github.com/turboderp-org/exllamav3) [check out ArtusDev's HuggingFace Page](https://huggingface.co/ArtusDev) for someh excellent EXL3 quants!
-<details>
-<summary>👈 Secret Recipe</summary>
-```bash
-#!/usr/bin/env bash
-custom="
-# 48 Repeating Layers [0-47]
-# Attention
-blk\..*\.attn_q.*=iq4_kt
-blk\..*\.attn_k.*=iq4_kt
-blk\..*\.attn_v.*=iq4_kt
-blk\..*\.attn_output.*=iq4_kt
-# Routed Experts
-blk\..*\.ffn_down_exps\.weight=iq4_kt
-blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kt
-# Non-Repeating Layers
-token_embd\.weight=iq4_kt
-output\.weight=iq6_k
-"
-custom=$(
-  echo "$custom" | grep -v '^#' | \
-  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
-)
-./build/bin/llama-quantize \
-    --custom-q "$custom" \
-    --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
-    /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
-    /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ4_KT.gguf \
-    IQ4_KT \
-    192
-```
-</details>
-## `IQ3_K` 14.509 GiB (4.082 BPW)
-Final estimate: PPL = TODO
-<details>
-<summary>👈 Secret Recipe</summary>
-```bash
-#!/usr/bin/env bash
-custom="
-# 48 Repeating Layers [0-47]
-# Attention
-blk\.(0)\.attn_q.*=q8_0
-blk\.(0)\.attn_k.*=q8_0
-blk\.(0)\.attn_v.*=q8_0
-blk\.(0)\.attn_output.*=q8_0
-blk\..*\.attn_q.*=iq5_k
-blk\..*\.attn_k.*=iq6_k
-blk\..*\.attn_v.*=iq6_k
-blk\..*\.attn_output.*=iq5_k
-# Routed Experts
-blk\.(0|47)\.ffn_down_exps\.weight=q8_0
-blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
-blk\..*\.ffn_down_exps\.weight=iq4_k
-blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
-# Non-Repeating Layers
-token_embd\.weight=iq4_k
-output\.weight=iq6_k
-"
-custom=$(
-  echo "$custom" | grep -v '^#' | \
-  sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
-)
-./build/bin/llama-quantize \
-    --custom-q "$custom" \
-    --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
-    /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
-    /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ3_K.gguf \
-    IQ3_K \
-    192
-```
-</details>
 ## `IQ3_KS` 13.633 GiB (3.836 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -346,7 +248,7 @@ custom=$(
 </details>
 ## `IQ2_KL` 11.516 GiB (3.240 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -398,7 +300,7 @@ custom=$(
 </details>
 ## `IQ2_KT` 9.469 GiB (2.664 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -449,7 +351,7 @@ custom=$(
 </summary>
 ## `IQ1_KT` 7.583 GiB (2.133 BPW)
-Final estimate: PPL = TODO
 <details>
@@ -500,18 +402,12 @@ custom=$(
 </details>
 ## Quick Start
-#### Full GPU Offload with CUDA or Vulkan (for AMD GPUs)
 ```bash
 # Compile CUDA backend
 cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
 cmake --build ./build --config Release -j $(nproc)
-# Compile Vulkan backend
-# Experimental doesn't work with all quant types, need to test some more
-# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
-cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
-cmake --build build --config Release -j $(nproc)
 # Run Server
 ./build/bin/llama-server \
     --model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \

 These first three are just test quants for baseline perplexity comparison:
 * `bf16` 56.894 GiB (16.007 BPW)
+  - Final estimate: PPL = 9.5334 +/- 0.07560
 * `Q8_0` 30.247 GiB (8.510 BPW)
+  - Final estimate: PPL = 9.5317 +/- 0.07551 (*NOTE* lower than BF16 but didn't use it for "baseline"...)
 * `Q4_0` 16.111 GiB (4.533 BPW)
+  - Final estimate: PPL = 9.7225 +/- 0.07712
 ## `IQ5_K` 21.324 GiB (5.999 BPW)
+Final estimate: PPL = 9.5930 +/- 0.07614
 <details>
 </details>
 ## `IQ4_K` 17.878 GiB (5.030 BPW)
+Final estimate: PPL = 9.6023 +/- 0.07613
 <details>
 </details>
 ## `IQ4_KSS` 15.531 GiB (4.370 BPW)
+Final estimate: PPL = 9.6441 +/- 0.07648
 <details>
 </details>
 ## `IQ3_KS` 13.633 GiB (3.836 BPW)
+Final estimate: PPL = 9.7940 +/- 0.07795
 <details>
 </details>
 ## `IQ2_KL` 11.516 GiB (3.240 BPW)
+Final estimate: PPL = 10.0475 +/- 0.08016
 <details>
 </details>
 ## `IQ2_KT` 9.469 GiB (2.664 BPW)
+Final estimate: PPL = 10.1352 +/- 0.08007
 <details>
 </summary>
 ## `IQ1_KT` 7.583 GiB (2.133 BPW)
+Final estimate: PPL = 11.0592 +/- 0.08760
 <details>
 </details>
 ## Quick Start
+#### Full GPU Offload with CUDA
 ```bash
 # Compile CUDA backend
 cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
 cmake --build ./build --config Release -j $(nproc)
 # Run Server
 ./build/bin/llama-server \
     --model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \

images/perplexity.png ADDED Viewed

Git LFS Details

SHA256: 0fe4f8a30de9cf560a50a3e8a130444c7de90857d2d035d24cfbaa959be4d512
Pointer size: 131 Bytes
Size of remote file: 148 kB