upload perplexity graph
Browse files- README.md +11 -115
- images/perplexity.png +3 -0
README.md
CHANGED
@@ -33,14 +33,14 @@ Perplexity computed against *wiki.test.raw*.
|
|
33 |
|
34 |
These first three are just test quants for baseline perplexity comparison:
|
35 |
* `bf16` 56.894 GiB (16.007 BPW)
|
36 |
-
- Final estimate: PPL =
|
37 |
* `Q8_0` 30.247 GiB (8.510 BPW)
|
38 |
-
- Final estimate: PPL =
|
39 |
* `Q4_0` 16.111 GiB (4.533 BPW)
|
40 |
-
- Final estimate: PPL =
|
41 |
|
42 |
## `IQ5_K` 21.324 GiB (5.999 BPW)
|
43 |
-
Final estimate: PPL =
|
44 |
|
45 |
<details>
|
46 |
|
@@ -92,7 +92,7 @@ custom=$(
|
|
92 |
</details>
|
93 |
|
94 |
## `IQ4_K` 17.878 GiB (5.030 BPW)
|
95 |
-
Final estimate: PPL =
|
96 |
|
97 |
<details>
|
98 |
|
@@ -144,7 +144,7 @@ custom=$(
|
|
144 |
</details>
|
145 |
|
146 |
## `IQ4_KSS` 15.531 GiB (4.370 BPW)
|
147 |
-
Final estimate: PPL =
|
148 |
|
149 |
<details>
|
150 |
|
@@ -195,106 +195,8 @@ custom=$(
|
|
195 |
|
196 |
</details>
|
197 |
|
198 |
-
## `IQ4_KT` 14.438 GiB (4.062 BPW)
|
199 |
-
Final estimate: PPL = TODO
|
200 |
-
|
201 |
-
Mostly pure IQ4_KT meant for full GPU offload similar to [turboderp-org/exllamav3](https://github.com/turboderp-org/exllamav3) [check out ArtusDev's HuggingFace Page](https://huggingface.co/ArtusDev) for someh excellent EXL3 quants!
|
202 |
-
|
203 |
-
<details>
|
204 |
-
|
205 |
-
<summary>👈 Secret Recipe</summary>
|
206 |
-
|
207 |
-
```bash
|
208 |
-
#!/usr/bin/env bash
|
209 |
-
|
210 |
-
custom="
|
211 |
-
# 48 Repeating Layers [0-47]
|
212 |
-
|
213 |
-
# Attention
|
214 |
-
blk\..*\.attn_q.*=iq4_kt
|
215 |
-
blk\..*\.attn_k.*=iq4_kt
|
216 |
-
blk\..*\.attn_v.*=iq4_kt
|
217 |
-
blk\..*\.attn_output.*=iq4_kt
|
218 |
-
|
219 |
-
# Routed Experts
|
220 |
-
blk\..*\.ffn_down_exps\.weight=iq4_kt
|
221 |
-
blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kt
|
222 |
-
|
223 |
-
# Non-Repeating Layers
|
224 |
-
token_embd\.weight=iq4_kt
|
225 |
-
output\.weight=iq6_k
|
226 |
-
"
|
227 |
-
|
228 |
-
custom=$(
|
229 |
-
echo "$custom" | grep -v '^#' | \
|
230 |
-
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
231 |
-
)
|
232 |
-
|
233 |
-
./build/bin/llama-quantize \
|
234 |
-
--custom-q "$custom" \
|
235 |
-
--imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
|
236 |
-
/mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
|
237 |
-
/mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ4_KT.gguf \
|
238 |
-
IQ4_KT \
|
239 |
-
192
|
240 |
-
```
|
241 |
-
|
242 |
-
</details>
|
243 |
-
|
244 |
-
## `IQ3_K` 14.509 GiB (4.082 BPW)
|
245 |
-
Final estimate: PPL = TODO
|
246 |
-
|
247 |
-
<details>
|
248 |
-
|
249 |
-
<summary>👈 Secret Recipe</summary>
|
250 |
-
|
251 |
-
```bash
|
252 |
-
#!/usr/bin/env bash
|
253 |
-
|
254 |
-
custom="
|
255 |
-
# 48 Repeating Layers [0-47]
|
256 |
-
|
257 |
-
# Attention
|
258 |
-
blk\.(0)\.attn_q.*=q8_0
|
259 |
-
blk\.(0)\.attn_k.*=q8_0
|
260 |
-
blk\.(0)\.attn_v.*=q8_0
|
261 |
-
blk\.(0)\.attn_output.*=q8_0
|
262 |
-
|
263 |
-
blk\..*\.attn_q.*=iq5_k
|
264 |
-
blk\..*\.attn_k.*=iq6_k
|
265 |
-
blk\..*\.attn_v.*=iq6_k
|
266 |
-
blk\..*\.attn_output.*=iq5_k
|
267 |
-
|
268 |
-
# Routed Experts
|
269 |
-
blk\.(0|47)\.ffn_down_exps\.weight=q8_0
|
270 |
-
blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
|
271 |
-
|
272 |
-
blk\..*\.ffn_down_exps\.weight=iq4_k
|
273 |
-
blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
|
274 |
-
|
275 |
-
# Non-Repeating Layers
|
276 |
-
token_embd\.weight=iq4_k
|
277 |
-
output\.weight=iq6_k
|
278 |
-
"
|
279 |
-
|
280 |
-
custom=$(
|
281 |
-
echo "$custom" | grep -v '^#' | \
|
282 |
-
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
|
283 |
-
)
|
284 |
-
|
285 |
-
./build/bin/llama-quantize \
|
286 |
-
--custom-q "$custom" \
|
287 |
-
--imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
|
288 |
-
/mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
|
289 |
-
/mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ3_K.gguf \
|
290 |
-
IQ3_K \
|
291 |
-
192
|
292 |
-
```
|
293 |
-
|
294 |
-
</details>
|
295 |
-
|
296 |
## `IQ3_KS` 13.633 GiB (3.836 BPW)
|
297 |
-
Final estimate: PPL =
|
298 |
|
299 |
<details>
|
300 |
|
@@ -346,7 +248,7 @@ custom=$(
|
|
346 |
</details>
|
347 |
|
348 |
## `IQ2_KL` 11.516 GiB (3.240 BPW)
|
349 |
-
Final estimate: PPL =
|
350 |
|
351 |
<details>
|
352 |
|
@@ -398,7 +300,7 @@ custom=$(
|
|
398 |
</details>
|
399 |
|
400 |
## `IQ2_KT` 9.469 GiB (2.664 BPW)
|
401 |
-
Final estimate: PPL =
|
402 |
|
403 |
<details>
|
404 |
|
@@ -449,7 +351,7 @@ custom=$(
|
|
449 |
</summary>
|
450 |
|
451 |
## `IQ1_KT` 7.583 GiB (2.133 BPW)
|
452 |
-
Final estimate: PPL =
|
453 |
|
454 |
<details>
|
455 |
|
@@ -500,18 +402,12 @@ custom=$(
|
|
500 |
</details>
|
501 |
|
502 |
## Quick Start
|
503 |
-
#### Full GPU Offload with CUDA
|
504 |
```bash
|
505 |
# Compile CUDA backend
|
506 |
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
|
507 |
cmake --build ./build --config Release -j $(nproc)
|
508 |
|
509 |
-
# Compile Vulkan backend
|
510 |
-
# Experimental doesn't work with all quant types, need to test some more
|
511 |
-
# https://github.com/ikawrakow/ik_llama.cpp/discussions/590
|
512 |
-
cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
|
513 |
-
cmake --build build --config Release -j $(nproc)
|
514 |
-
|
515 |
# Run Server
|
516 |
./build/bin/llama-server \
|
517 |
--model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \
|
|
|
33 |
|
34 |
These first three are just test quants for baseline perplexity comparison:
|
35 |
* `bf16` 56.894 GiB (16.007 BPW)
|
36 |
+
- Final estimate: PPL = 9.5334 +/- 0.07560
|
37 |
* `Q8_0` 30.247 GiB (8.510 BPW)
|
38 |
+
- Final estimate: PPL = 9.5317 +/- 0.07551 (*NOTE* lower than BF16 but didn't use it for "baseline"...)
|
39 |
* `Q4_0` 16.111 GiB (4.533 BPW)
|
40 |
+
- Final estimate: PPL = 9.7225 +/- 0.07712
|
41 |
|
42 |
## `IQ5_K` 21.324 GiB (5.999 BPW)
|
43 |
+
Final estimate: PPL = 9.5930 +/- 0.07614
|
44 |
|
45 |
<details>
|
46 |
|
|
|
92 |
</details>
|
93 |
|
94 |
## `IQ4_K` 17.878 GiB (5.030 BPW)
|
95 |
+
Final estimate: PPL = 9.6023 +/- 0.07613
|
96 |
|
97 |
<details>
|
98 |
|
|
|
144 |
</details>
|
145 |
|
146 |
## `IQ4_KSS` 15.531 GiB (4.370 BPW)
|
147 |
+
Final estimate: PPL = 9.6441 +/- 0.07648
|
148 |
|
149 |
<details>
|
150 |
|
|
|
195 |
|
196 |
</details>
|
197 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
198 |
## `IQ3_KS` 13.633 GiB (3.836 BPW)
|
199 |
+
Final estimate: PPL = 9.7940 +/- 0.07795
|
200 |
|
201 |
<details>
|
202 |
|
|
|
248 |
</details>
|
249 |
|
250 |
## `IQ2_KL` 11.516 GiB (3.240 BPW)
|
251 |
+
Final estimate: PPL = 10.0475 +/- 0.08016
|
252 |
|
253 |
<details>
|
254 |
|
|
|
300 |
</details>
|
301 |
|
302 |
## `IQ2_KT` 9.469 GiB (2.664 BPW)
|
303 |
+
Final estimate: PPL = 10.1352 +/- 0.08007
|
304 |
|
305 |
<details>
|
306 |
|
|
|
351 |
</summary>
|
352 |
|
353 |
## `IQ1_KT` 7.583 GiB (2.133 BPW)
|
354 |
+
Final estimate: PPL = 11.0592 +/- 0.08760
|
355 |
|
356 |
<details>
|
357 |
|
|
|
402 |
</details>
|
403 |
|
404 |
## Quick Start
|
405 |
+
#### Full GPU Offload with CUDA
|
406 |
```bash
|
407 |
# Compile CUDA backend
|
408 |
cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
|
409 |
cmake --build ./build --config Release -j $(nproc)
|
410 |
|
|
|
|
|
|
|
|
|
|
|
|
|
411 |
# Run Server
|
412 |
./build/bin/llama-server \
|
413 |
--model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \
|
images/perplexity.png
ADDED
![]() |
Git LFS Details
|