ubergarm commited on
Commit
0737389
·
1 Parent(s): 0e387d8

Add benchmarks discussion to README

Browse files
Files changed (1) hide show
  1. README.md +12 -6
README.md CHANGED
@@ -28,7 +28,7 @@ Excited to share and learn together. Thanks!
28
  So far these are my best recipes offering the great quality in good memory footprint breakpoints.
29
 
30
  #### ubergarm/Qwen3-235B-A22B-mix-IQ3_K.gguf
31
- This quant is designed to run at max speed with just under ~110GiB (V)RAM combinations e.g. 24GB VRAM + 96GB RAM (perfect for AM5 or LGA 1700 gamer rigs with 2x48GiB DDR5 DIMMs for max performance). This will allow for `-rtr` run-time repacking for maximum CPU throughput. You can still omit `-rtr` and use default `mmap()` behavior to run in less RAM at a penalty to speed. Or you can also "offline repack" to fit your exact setup and get the best of both worlds with quicker startup with `mmap()` and max CPU throughput.
32
  ```
33
  106.830 GiB (3.903 BPW)
34
 
@@ -38,13 +38,15 @@ iq3_k: 188 tensors
38
  iq4_k: 94 tensors
39
  iq6_k: 376 tensors
40
 
41
- Final estimate: PPL = 5.4403 +/- 0.03421 (wiki.test.raw, compare to Q8_0 at 5.3141 +/- 0.03321) (TODO: more benchmarking)
42
  ```
43
 
44
  ## Quick Start
45
  #### `ik_llama.cpp` API server for hybrid GPU+CPU inferencing
46
  ```bash
47
  # This example for 24GB VRAM + 96 GB RAM + 16 physical core CPU
 
 
48
  ./build/bin/llama-server
49
  --model ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
50
  --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
@@ -54,9 +56,9 @@ Final estimate: PPL = 5.4403 +/- 0.03421 (wiki.test.raw, compare to Q8_0 at 5.31
54
  -fmoe \
55
  -amb 512 \
56
  -rtr \
57
- -ot blk\.[0-9]\.ffn.*=CUDA0 \
58
- -ot blk\.1[0-2]\.ffn.*=CUDA0 \
59
- -ot exps=CPU \
60
  -ngl 99 \
61
  --threads 16
62
  --host 127.0.0.1 \
@@ -137,9 +139,13 @@ custom=$(
137
  </details>
138
 
139
  ## Discussion
140
- TODO: Discuss some about comparing quants e.g. bartowski, unsloth, and mradermacher including "quality" and "speed".
 
 
 
141
 
142
  ## References
143
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
144
  * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
 
145
  * [imatrix calibration_data_v5_rc.txt](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt)
 
28
  So far these are my best recipes offering the great quality in good memory footprint breakpoints.
29
 
30
  #### ubergarm/Qwen3-235B-A22B-mix-IQ3_K.gguf
31
+ This quant is designed to run at max speed with just under ~110GiB (V)RAM combinations e.g. 24GB VRAM + 96GB RAM (perfect for AM5 or LGA 1700 gamer rigs with 2x48GiB DDR5 DIMMs for max performance). This will allow for `-rtr` run-time repacking for maximum CPU throughput. You can still omit `-rtr` and use default `mmap()` behavior to run in less RAM at a penalty to speed. Or you can also "offline repack" to fit your exact setup and get the best of both worlds with quicker startup with `mmap()` and max CPU throughput. However, you might have to `--no-mmap` anyway depending on how Transparent Hugepages (THPs) are configured and effect performance on your rig.
32
  ```
33
  106.830 GiB (3.903 BPW)
34
 
 
38
  iq4_k: 94 tensors
39
  iq6_k: 376 tensors
40
 
41
+ Final estimate: PPL = 5.4403 +/- 0.03421 (wiki.test.raw, compare to Q8_0 at 5.3141 +/- 0.03321) (*TODO*: more benchmarking)
42
  ```
43
 
44
  ## Quick Start
45
  #### `ik_llama.cpp` API server for hybrid GPU+CPU inferencing
46
  ```bash
47
  # This example for 24GB VRAM + 96 GB RAM + 16 physical core CPU
48
+ # Offload first ffn layers 0-11 on GPU VRAM.
49
+ # Offload final ffn layers 12-93 on CPU RAM.
50
  ./build/bin/llama-server
51
  --model ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
52
  --alias ubergarm/Qwen3-235B-A22B-mix-IQ3_K \
 
56
  -fmoe \
57
  -amb 512 \
58
  -rtr \
59
+ -ot blk\.1[2-9]\.ffn.*=CPU \
60
+ -ot blk\.[2-8][0-9]\.ffn.*=CPU \
61
+ -ot blk\.9[0-3]\.ffn.*=CPU \
62
  -ngl 99 \
63
  --threads 16
64
  --host 127.0.0.1 \
 
139
  </details>
140
 
141
  ## Discussion
142
+ *TODO*: Discuss some about comparing quants e.g. bartowski, unsloth, and mradermacher including "quality" and "speed".
143
+
144
+ ## Benchmarks
145
+ In first tests with `llama-sweep-bench` I'm getting up to 140 tok/sec PP and 10 tok/sec TG on my 3090TI FE 24GB VRAM + AMD 9950X 2x48GB DDR5-6400 96GB RAM with OC infinity fabric. It does slow down of course as it gets deeper into the full 32k context. Check the linked Benchmarks Discussion for updates as this is all pretty fresh right now. Pretty amazing performance for a high quality LLM on a high-end gaming rig though!
146
 
147
  ## References
148
  * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
149
  * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
150
+ * [ik_llama.cpp Benchmarks Discussion](https://github.com/ikawrakow/ik_llama.cpp/discussions/357)
151
  * [imatrix calibration_data_v5_rc.txt](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt)