ubergarm commited on
Commit
6604417
·
1 Parent(s): 1994430

upload perplexity graph

Browse files
Files changed (2) hide show
  1. README.md +11 -115
  2. images/perplexity.png +3 -0
README.md CHANGED
@@ -33,14 +33,14 @@ Perplexity computed against *wiki.test.raw*.
33
 
34
  These first three are just test quants for baseline perplexity comparison:
35
  * `bf16` 56.894 GiB (16.007 BPW)
36
- - Final estimate: PPL = TODO
37
  * `Q8_0` 30.247 GiB (8.510 BPW)
38
- - Final estimate: PPL = TODO
39
  * `Q4_0` 16.111 GiB (4.533 BPW)
40
- - Final estimate: PPL = TODO
41
 
42
  ## `IQ5_K` 21.324 GiB (5.999 BPW)
43
- Final estimate: PPL = TODO
44
 
45
  <details>
46
 
@@ -92,7 +92,7 @@ custom=$(
92
  </details>
93
 
94
  ## `IQ4_K` 17.878 GiB (5.030 BPW)
95
- Final estimate: PPL = TODO
96
 
97
  <details>
98
 
@@ -144,7 +144,7 @@ custom=$(
144
  </details>
145
 
146
  ## `IQ4_KSS` 15.531 GiB (4.370 BPW)
147
- Final estimate: PPL = TODO
148
 
149
  <details>
150
 
@@ -195,106 +195,8 @@ custom=$(
195
 
196
  </details>
197
 
198
- ## `IQ4_KT` 14.438 GiB (4.062 BPW)
199
- Final estimate: PPL = TODO
200
-
201
- Mostly pure IQ4_KT meant for full GPU offload similar to [turboderp-org/exllamav3](https://github.com/turboderp-org/exllamav3) [check out ArtusDev's HuggingFace Page](https://huggingface.co/ArtusDev) for someh excellent EXL3 quants!
202
-
203
- <details>
204
-
205
- <summary>👈 Secret Recipe</summary>
206
-
207
- ```bash
208
- #!/usr/bin/env bash
209
-
210
- custom="
211
- # 48 Repeating Layers [0-47]
212
-
213
- # Attention
214
- blk\..*\.attn_q.*=iq4_kt
215
- blk\..*\.attn_k.*=iq4_kt
216
- blk\..*\.attn_v.*=iq4_kt
217
- blk\..*\.attn_output.*=iq4_kt
218
-
219
- # Routed Experts
220
- blk\..*\.ffn_down_exps\.weight=iq4_kt
221
- blk\..*\.ffn_(gate|up)_exps\.weight=iq4_kt
222
-
223
- # Non-Repeating Layers
224
- token_embd\.weight=iq4_kt
225
- output\.weight=iq6_k
226
- "
227
-
228
- custom=$(
229
- echo "$custom" | grep -v '^#' | \
230
- sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
231
- )
232
-
233
- ./build/bin/llama-quantize \
234
- --custom-q "$custom" \
235
- --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
236
- /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
237
- /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ4_KT.gguf \
238
- IQ4_KT \
239
- 192
240
- ```
241
-
242
- </details>
243
-
244
- ## `IQ3_K` 14.509 GiB (4.082 BPW)
245
- Final estimate: PPL = TODO
246
-
247
- <details>
248
-
249
- <summary>👈 Secret Recipe</summary>
250
-
251
- ```bash
252
- #!/usr/bin/env bash
253
-
254
- custom="
255
- # 48 Repeating Layers [0-47]
256
-
257
- # Attention
258
- blk\.(0)\.attn_q.*=q8_0
259
- blk\.(0)\.attn_k.*=q8_0
260
- blk\.(0)\.attn_v.*=q8_0
261
- blk\.(0)\.attn_output.*=q8_0
262
-
263
- blk\..*\.attn_q.*=iq5_k
264
- blk\..*\.attn_k.*=iq6_k
265
- blk\..*\.attn_v.*=iq6_k
266
- blk\..*\.attn_output.*=iq5_k
267
-
268
- # Routed Experts
269
- blk\.(0|47)\.ffn_down_exps\.weight=q8_0
270
- blk\.(0|47)\.ffn_(gate|up)_exps\.weight=q8_0
271
-
272
- blk\..*\.ffn_down_exps\.weight=iq4_k
273
- blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
274
-
275
- # Non-Repeating Layers
276
- token_embd\.weight=iq4_k
277
- output\.weight=iq6_k
278
- "
279
-
280
- custom=$(
281
- echo "$custom" | grep -v '^#' | \
282
- sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
283
- )
284
-
285
- ./build/bin/llama-quantize \
286
- --custom-q "$custom" \
287
- --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/imatrix-Qwen3-Coder-30B-A3B-Instruct-BF16.dat \
288
- /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-BF16-00001-of-00002.gguf \
289
- /mnt/raid/models/ubergarm/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-IQ3_K.gguf \
290
- IQ3_K \
291
- 192
292
- ```
293
-
294
- </details>
295
-
296
  ## `IQ3_KS` 13.633 GiB (3.836 BPW)
297
- Final estimate: PPL = TODO
298
 
299
  <details>
300
 
@@ -346,7 +248,7 @@ custom=$(
346
  </details>
347
 
348
  ## `IQ2_KL` 11.516 GiB (3.240 BPW)
349
- Final estimate: PPL = TODO
350
 
351
  <details>
352
 
@@ -398,7 +300,7 @@ custom=$(
398
  </details>
399
 
400
  ## `IQ2_KT` 9.469 GiB (2.664 BPW)
401
- Final estimate: PPL = TODO
402
 
403
  <details>
404
 
@@ -449,7 +351,7 @@ custom=$(
449
  </summary>
450
 
451
  ## `IQ1_KT` 7.583 GiB (2.133 BPW)
452
- Final estimate: PPL = TODO
453
 
454
  <details>
455
 
@@ -500,18 +402,12 @@ custom=$(
500
  </details>
501
 
502
  ## Quick Start
503
- #### Full GPU Offload with CUDA or Vulkan (for AMD GPUs)
504
  ```bash
505
  # Compile CUDA backend
506
  cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
507
  cmake --build ./build --config Release -j $(nproc)
508
 
509
- # Compile Vulkan backend
510
- # Experimental doesn't work with all quant types, need to test some more
511
- # https://github.com/ikawrakow/ik_llama.cpp/discussions/590
512
- cmake -B build -DCMAKE_BUILD_TYPE=Release -DGGML_HIPBLAS=0 -DGGML_VULKAN=1
513
- cmake --build build --config Release -j $(nproc)
514
-
515
  # Run Server
516
  ./build/bin/llama-server \
517
  --model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \
 
33
 
34
  These first three are just test quants for baseline perplexity comparison:
35
  * `bf16` 56.894 GiB (16.007 BPW)
36
+ - Final estimate: PPL = 9.5334 +/- 0.07560
37
  * `Q8_0` 30.247 GiB (8.510 BPW)
38
+ - Final estimate: PPL = 9.5317 +/- 0.07551 (*NOTE* lower than BF16 but didn't use it for "baseline"...)
39
  * `Q4_0` 16.111 GiB (4.533 BPW)
40
+ - Final estimate: PPL = 9.7225 +/- 0.07712
41
 
42
  ## `IQ5_K` 21.324 GiB (5.999 BPW)
43
+ Final estimate: PPL = 9.5930 +/- 0.07614
44
 
45
  <details>
46
 
 
92
  </details>
93
 
94
  ## `IQ4_K` 17.878 GiB (5.030 BPW)
95
+ Final estimate: PPL = 9.6023 +/- 0.07613
96
 
97
  <details>
98
 
 
144
  </details>
145
 
146
  ## `IQ4_KSS` 15.531 GiB (4.370 BPW)
147
+ Final estimate: PPL = 9.6441 +/- 0.07648
148
 
149
  <details>
150
 
 
195
 
196
  </details>
197
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
198
  ## `IQ3_KS` 13.633 GiB (3.836 BPW)
199
+ Final estimate: PPL = 9.7940 +/- 0.07795
200
 
201
  <details>
202
 
 
248
  </details>
249
 
250
  ## `IQ2_KL` 11.516 GiB (3.240 BPW)
251
+ Final estimate: PPL = 10.0475 +/- 0.08016
252
 
253
  <details>
254
 
 
300
  </details>
301
 
302
  ## `IQ2_KT` 9.469 GiB (2.664 BPW)
303
+ Final estimate: PPL = 10.1352 +/- 0.08007
304
 
305
  <details>
306
 
 
351
  </summary>
352
 
353
  ## `IQ1_KT` 7.583 GiB (2.133 BPW)
354
+ Final estimate: PPL = 11.0592 +/- 0.08760
355
 
356
  <details>
357
 
 
402
  </details>
403
 
404
  ## Quick Start
405
+ #### Full GPU Offload with CUDA
406
  ```bash
407
  # Compile CUDA backend
408
  cmake -B ./build -DCMAKE_BUILD_TYPE=Release -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_F16=ON
409
  cmake --build ./build --config Release -j $(nproc)
410
 
 
 
 
 
 
 
411
  # Run Server
412
  ./build/bin/llama-server \
413
  --model Qwen3-Coder-30B-A3B-Instruct-IQ3_KS.gguf \
images/perplexity.png ADDED

Git LFS Details

  • SHA256: 0fe4f8a30de9cf560a50a3e8a130444c7de90857d2d035d24cfbaa959be4d512
  • Pointer size: 131 Bytes
  • Size of remote file: 148 kB