ubergarm commited on
Commit
5af7b05
·
0 Parent(s):

initial commit

Browse files
Files changed (2) hide show
  1. .gitattributes +38 -0
  2. README.md +142 -0
.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ *.dat filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,142 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ quantized_by: ubergarm
3
+ pipeline_tag: text-generation
4
+ base_model: Qwen/Qwen3-235B-A22B
5
+ license: mit
6
+ base_model_relation: quantized
7
+ tags:
8
+ - imatrix
9
+ - qwen3_moe
10
+ - conversational
11
+ - ik_llama.cpp
12
+ ---
13
+
14
+ ## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-235B-A22B
15
+
16
+ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support advanced non-linear SotA quants. Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
17
+
18
+ These quants provide best in class quality for the given memory footprint.
19
+
20
+ ## Big Thanks
21
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
22
+
23
+ Also thanks to all the folks in the quanting and inferencing community here and on `r/LocalLLaMA` for tips and tricks helping each other run all the fun new models!
24
+
25
+ Excited to share and learn together. Thanks!
26
+
27
+ ## Quant Collection
28
+ So far these are my best recipes offering the great quality in good memory footprint breakpoints.
29
+
30
+ #### ubergarm/Qwen3-235B-A22B-mix-IQ3_K.gguf
31
+ This quant is designed to run at max speed with just under ~110GiB (V)RAM combinations e.g. 24GB VRAM + 96GB RAM (perfect for AM5 or LGA 1700 gamer rigs with 2x48GiB DDR5 DIMMs for max performance). This will allow for `-rtr` run-time repacking for maximum CPU throughput. You can still omit `-rtr` and use default `mmap()` behavior to run in less RAM at a penalty to speed. Or you can also "offline repack" to fit your exact setup and get the best of both worlds with quicker startup with `mmap()` and max CPU throughput.
32
+ ```
33
+ 106.830 GiB (3.903 BPW)
34
+
35
+ f32: 471 tensors
36
+ q8_0: 2 tensors
37
+ iq3_k: 188 tensors
38
+ iq4_k: 94 tensors
39
+ iq6_k: 376 tensors
40
+
41
+ Final estimate: PPL = 5.4403 +/- 0.03421 (compare to Q8_0 at 5.3141 +/- 0.03321) (TODO: more benchmarking)
42
+ ```
43
+
44
+ ## Quick Start
45
+ #### `ik_llama.cpp` API server for GPU inferencing
46
+ ```bash
47
+ # This example for 24GB VRAM + 96 GB RAM + 16 physical core CPU
48
+ ./build/bin/llama-server
49
+ --model ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K.gguf \
50
+ -fa \
51
+ -ctk q8_0 -ctv q8_0 \
52
+ -c 32768 \
53
+ -fmoe \
54
+ -amb 512 \
55
+ -rtr \
56
+ -ot blk\.[0-9]\.ffn.*=CUDA0 \
57
+ -ot blk\.1[0-2]\.ffn.*=CUDA0 \
58
+ -ot exps=CPU \
59
+ -ngl 99 \
60
+ --threads 16
61
+ --host 127.0.0.1 \
62
+ --port 8080
63
+ ```
64
+
65
+ If you want more context and/or less VRAM usage, you can try:
66
+ * Smaller KV Cache quantization `-ctk q4_0 -ctv q4_0`
67
+
68
+ ## Model Architechture
69
+
70
+ There are 94 repeating layers/blocks with the unquantized `bf16` version being `448501.04 MB` total.
71
+
72
+ | Tensor | Dimension | Data Type | Size |
73
+ | --- | --- | --- | --- |
74
+ |token_embd.weight | [ 4096, 151936, 1, 1] | bf16 | 1187.00 MiB |
75
+ | | | | |
76
+ |blk.1.attn_k_norm.weight | [ 128, 1, 1, 1] | f32 | 0.000 MiB |
77
+ |blk.1.attn_q_norm.weight | [ 128, 1, 1, 1] | f32 | 0.000 MiB |
78
+ |blk.1.attn_norm.weight | [ 4096, 1, 1, 1] | f32 | 0.016 MiB |
79
+ |blk.1.ffn_gate_inp.weight | [ 4096, 128, 1, 1] | f32 | 2.000 MiB |
80
+ |blk.1.ffn_norm.weight | [ 4096, 1, 1, 1] | f32 | 0.016 MiB |
81
+ | | | | |
82
+ |blk.1.attn_k.weight | [ 4096, 512, 1, 1] | bf16 | 4.00 MiB |
83
+ |blk.1.attn_q.weight | [ 4096, 8192, 1, 1] | bf16 | 64.00 MiB |
84
+ |blk.1.attn_v.weight | [ 4096, 512, 1, 1] | bf16 | 4.00 MiB |
85
+ |blk.1.attn_output.weight | [ 8192, 4096, 1, 1] | bf16 | 64.00 MiB |
86
+ | | | | |
87
+ |blk.1.ffn_down_exps.weight | [ 1536, 4096, 128, 1] | bf16 | 1536.00 MiB |
88
+ |blk.1.ffn_gate_exps.weight | [ 4096, 1536, 128, 1] | bf16 | 1536.00 MiB |
89
+ |blk.1.ffn_up_exps.weight | [ 4096, 1536, 128, 1] | bf16 | 1536.00 MiB |
90
+ | | | | |
91
+ |output.weight | [ 4096, 151936, 1, 1] | bf16 | 1187.00 MiB |
92
+ |output.norm_weight | [ 4096, 1, 1, 1] | f32 | 0.016MiB |
93
+
94
+ ## Quantization
95
+ <details>
96
+
97
+ <summary>👈Secret Recipe</summary>
98
+
99
+ ```
100
+ #!/usr/bin/env bash
101
+
102
+ custom="
103
+ # Attention
104
+ blk\..*\.attn_k.*=iq6_k
105
+ blk\..*\.attn_q.*=iq6_k
106
+ blk\..*\.attn_v.*=iq6_k
107
+ blk\..*\.attn_output.*=iq6_k
108
+
109
+ # Token Embedding (put these second so attn_output regex doesn't become q8_0)
110
+ token_embd\.weight=q8_0
111
+ output\.weight=q8_0
112
+
113
+ # Experts
114
+ blk\..*\.ffn_down_exps\.weight=iq4_k
115
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
116
+ "
117
+
118
+ custom=$(
119
+ echo "$custom" | grep -v '^#' | \
120
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
121
+ )
122
+
123
+ #--token-embedding-type q8_0 \
124
+ #--output-tensor-type q8_0 \
125
+ ./build/bin/llama-quantize \
126
+ --custom-q "$custom" \
127
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/imatrix-Qwen3-235B-A22B.dat \
128
+ /mnt/raid/models/Qwen/Qwen3-235B-A22B/Qwen3-235B-A22B-BF16-00001-of-00011.gguf \
129
+ /mnt/raid/models/ubergarm/Qwen3-235B-A22B-GGUF/Qwen3-235B-A22B-mix-IQ3_K.gguf \
130
+ IQ3_K \
131
+ 24
132
+ ```
133
+
134
+ </details>
135
+
136
+ ## Discussion
137
+ TODO: Discuss some about comparing quants e.g. bartowski, unsloth, and mradermacher including "quality" and "speed".
138
+
139
+ ## References
140
+ * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/)
141
+ * [ik_llama.cpp Getting Started Guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
142
+ * [imatrix calibration_data_v5_rc.txt](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c#file-calibration_data_v5_rc-txt)