ubergarm commited on
Commit
855506d
Β·
1 Parent(s): 701ea4f

initial commit

Browse files
Files changed (2) hide show
  1. .gitattributes +3 -0
  2. README.md +366 -0
.gitattributes CHANGED
@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ imatrix-*.dat filter=lfs diff=lfs merge=lfs -text
37
+ *.gguf filter=lfs diff=lfs merge=lfs -text
38
+ *.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,369 @@
1
  ---
 
 
 
2
  license: apache-2.0
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ quantized_by: ubergarm
3
+ pipeline_tag: text-generation
4
+ base_model: Qwen/Qwen3-Coder-480B-A35B-Instruct
5
  license: apache-2.0
6
+ license_link: https://huggingface.co/Qwen/Qwen3-Coder-480B-A35B-Instruct/blob/main/LICENSE
7
+ base_model_relation: quantized
8
+ tags:
9
+ - imatrix
10
+ - qwen3_moe
11
+ - conversational
12
+ - ik_llama.cpp
13
  ---
14
+
15
+ *WIP* Still cooking and will upload ASAP.
16
+
17
+ ## `ik_llama.cpp` imatrix Quantizations of Qwen/Qwen3-Coder-480B-A35B-Instruct
18
+ This quant collection **REQUIRES** [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/) fork to support the ik's latest SOTA quants and optimizations! Do **not** download these big files and expect them to run on mainline vanilla llama.cpp, ollama, LM Studio, KoboldCpp, etc!
19
+
20
+ *NOTE* `ik_llama.cpp` can also run your existing GGUFs from bartowski, unsloth, mradermacher, etc if you want to try it out before downloading my quants.
21
+
22
+ Some of ik's new quants are supported with [Nexesenex/croco.cpp](https://github.com/Nexesenex/croco.cpp) fork of KoboldCPP.
23
+
24
+ These quants provide best in class perplexity for the given memory footprint.
25
+
26
+ ## Big Thanks
27
+ Shout out to Wendell and the **Level1Techs** crew, the community [Forums](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826), [YouTube Channel](https://www.youtube.com/@Level1Techs)! **BIG thanks** for providing **BIG hardware** expertise and access to run these experiments and make these great quants available to the community!!!
28
+
29
+ Also thanks to all the folks in the quanting and inferencing community on [BeaverAI Club Discord](https://huggingface.co/BeaverAI) and on [r/LocalLLaMA](https://www.reddit.com/r/LocalLLaMA/) for tips and tricks helping each other run, test, and benchmark all the fun new models!
30
+
31
+ ## Quant Collection
32
+ Perplexity computed against *wiki.test.raw*. These first two are just test quants for baseline perplexity comparison:
33
+
34
+ * `bf16` TODO
35
+ - Final estimate: PPL = TODO
36
+ * `Q8_0` 475.297 GiB (8.503 BPW)
37
+ - Final estimate: PPL = TODO
38
+
39
+ ## `IQ5_K` TODO
40
+ Final estimate: TODO
41
+
42
+ <details>
43
+
44
+ <summary>πŸ‘ˆ Secret Recipe</summary>
45
+
46
+ ```bash
47
+ #!/usr/bin/env bash
48
+
49
+ # Repeating Layers [0-61]
50
+
51
+ custom="
52
+ # Attention
53
+ blk\..*\.attn_q.*=iq6_k
54
+ blk\..*\.attn_k.*=q8_0
55
+ blk\..*\.attn_v.*=q8_0
56
+ blk\..*\.attn_output.*=iq6_k
57
+
58
+ # Routed Experts
59
+ blk\..*\.ffn_down_exps\.weight=iq6_k
60
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq5_k
61
+
62
+ # Non-Repeating Layers
63
+ token_embd\.weight=iq6_k
64
+ output\.weight=iq6_k
65
+ "
66
+
67
+ custom=$(
68
+ echo "$custom" | grep -v '^#' | \
69
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
70
+ )
71
+
72
+ numactl -N 0 -m 0 \
73
+ ./build/bin/llama-quantize \
74
+ --custom-q "$custom" \
75
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
76
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
77
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ5_K.gguf \
78
+ IQ5_K \
79
+ 192
80
+ ```
81
+
82
+ </details>
83
+
84
+ ## `IQ4_K` TODO
85
+ Final estimate: TODO
86
+
87
+ <details>
88
+
89
+ <summary>πŸ‘ˆ Secret Recipe</summary>
90
+
91
+ ```bash
92
+ #!/usr/bin/env bash
93
+
94
+ # Repeating Layers [0-61]
95
+
96
+ custom="
97
+ # Attention
98
+ blk\..*\.attn_q.*=iq6_k
99
+ blk\..*\.attn_k.*=q8_0
100
+ blk\..*\.attn_v.*=q8_0
101
+ blk\..*\.attn_output.*=iq6_k
102
+
103
+ # Routed Experts
104
+ blk\..*\.ffn_down_exps\.weight=iq5_k
105
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq4_k
106
+
107
+ # Non-Repeating Layers
108
+ token_embd\.weight=iq6_k
109
+ output\.weight=iq6_k
110
+ "
111
+
112
+ custom=$(
113
+ echo "$custom" | grep -v '^#' | \
114
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
115
+ )
116
+
117
+ numactl -N 0 -m 0 \
118
+ ./build/bin/llama-quantize \
119
+ --custom-q "$custom" \
120
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
121
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
122
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ4_K.gguf \
123
+ IQ4_K \
124
+ 192
125
+ ```
126
+
127
+ </details>
128
+
129
+ ## `IQ3_K` TODO
130
+ Final estimate: TODO
131
+
132
+ <details>
133
+
134
+ <summary>πŸ‘ˆ Secret Recipe</summary>
135
+
136
+ ```bash
137
+ #!/usr/bin/env bash
138
+
139
+ # Repeating Layers [0-61]
140
+
141
+ custom="
142
+ # Attention
143
+ blk\..*\.attn_q.*=iq6_k
144
+ blk\..*\.attn_k.*=q8_0
145
+ blk\..*\.attn_v.*=q8_0
146
+ blk\..*\.attn_output.*=iq6_k
147
+
148
+ # Routed Experts
149
+ blk\..*\.ffn_down_exps\.weight=iq4_k
150
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq3_k
151
+
152
+ # Non-Repeating Layers
153
+ token_embd\.weight=iq4_k
154
+ output\.weight=iq6_k
155
+ "
156
+
157
+ custom=$(
158
+ echo "$custom" | grep -v '^#' | \
159
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
160
+ )
161
+
162
+ numactl -N 0 -m 0 \
163
+ ./build/bin/llama-quantize \
164
+ --custom-q "$custom" \
165
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
166
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
167
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ3_K.gguf \
168
+ IQ3_K \
169
+ 192
170
+ ```
171
+
172
+ </details>
173
+
174
+ ## `IQ2_KL` TODO
175
+ Final estimate: TODO
176
+
177
+ <details>
178
+
179
+ <summary>πŸ‘ˆ Secret Recipe</summary>
180
+
181
+ ```bash
182
+ #!/usr/bin/env bash
183
+
184
+ # Repeating Layers [0-61]
185
+
186
+ custom="
187
+ # Attention
188
+ blk\..*\.attn_q.*=iq6_k
189
+ blk\..*\.attn_k.*=q8_0
190
+ blk\..*\.attn_v.*=q8_0
191
+ blk\..*\.attn_output.*=iq6_k
192
+
193
+ # Routed Experts
194
+ blk\..*\.ffn_down_exps\.weight=iq3_k
195
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kl
196
+
197
+ # Non-Repeating Layers
198
+ token_embd\.weight=iq4_k
199
+ output\.weight=iq6_k
200
+ "
201
+
202
+ custom=$(
203
+ echo "$custom" | grep -v '^#' | \
204
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
205
+ )
206
+
207
+ numactl -N 0 -m 0 \
208
+ ./build/bin/llama-quantize \
209
+ --custom-q "$custom" \
210
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
211
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
212
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KL.gguf \
213
+ IQ2_KL \
214
+ 192
215
+ ```
216
+
217
+ </details>
218
+
219
+ ## `IQ2_K` TODO
220
+ Final estimate: TODO
221
+
222
+ <details>
223
+
224
+ <summary>πŸ‘ˆ Secret Recipe</summary>
225
+
226
+ ```bash
227
+ #!/usr/bin/env bash
228
+
229
+ # Repeating Layers [0-61]
230
+
231
+ custom="
232
+ # Attention
233
+ blk\..*\.attn_q.*=iq6_k
234
+ blk\..*\.attn_k.*=q8_0
235
+ blk\..*\.attn_v.*=q8_0
236
+ blk\..*\.attn_output.*=iq6_k
237
+
238
+ # Routed Experts
239
+ blk\..*\.ffn_down_exps\.weight=iq2_kl
240
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_k
241
+
242
+ # Non-Repeating Layers
243
+ token_embd\.weight=iq4_k
244
+ output\.weight=iq6_k
245
+ "
246
+
247
+ custom=$(
248
+ echo "$custom" | grep -v '^#' | \
249
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
250
+ )
251
+
252
+ numactl -N 0 -m 0 \
253
+ ./build/bin/llama-quantize \
254
+ --custom-q "$custom" \
255
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
256
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
257
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_K.gguf \
258
+ IQ2_K \
259
+ 192
260
+ ```
261
+
262
+ </details>
263
+
264
+ ## `IQ2_KS` TODO
265
+ Final estimate: TODO
266
+
267
+ <details>
268
+
269
+ <summary>πŸ‘ˆ Secret Recipe</summary>
270
+
271
+ ```bash
272
+ #!/usr/bin/env bash
273
+
274
+ # Repeating Layers [0-61]
275
+
276
+ custom="
277
+ # Attention
278
+ blk\..*\.attn_q.*=iq4_ks
279
+ blk\..*\.attn_k.*=iq5_ks
280
+ blk\..*\.attn_v.*=iq5_ks
281
+ blk\..*\.attn_output.*=iq4_ks
282
+
283
+ # Routed Experts
284
+ blk\..*\.ffn_down_exps\.weight=iq3_ks
285
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_ks
286
+
287
+ # Non-Repeating Layers
288
+ token_embd\.weight=iq4_k
289
+ output\.weight=iq6_k
290
+ "
291
+
292
+ custom=$(
293
+ echo "$custom" | grep -v '^#' | \
294
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
295
+ )
296
+
297
+ numactl -N 0 -m 0 \
298
+ ./build/bin/llama-quantize \
299
+ --custom-q "$custom" \
300
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
301
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
302
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KS.gguf \
303
+ IQ2_KS \
304
+ 192
305
+ ```
306
+
307
+ </details>
308
+
309
+ ## `IQ2_KT` TODO
310
+ Final estimate: TODO
311
+
312
+ <details>
313
+
314
+ <summary>πŸ‘ˆ Secret Recipe</summary>
315
+
316
+ ```bash
317
+ blk\..*\.attn_q.*=iq4_kt
318
+ blk\..*\.attn_k.*=iq4_kt
319
+ blk\..*\.attn_v.*=iq4_kt
320
+ blk\..*\.attn_output.*=iq4_kt
321
+
322
+ # Routed Experts
323
+ blk\..*\.ffn_down_exps\.weight=iq3_kt
324
+ blk\..*\.ffn_(gate|up)_exps\.weight=iq2_kt
325
+
326
+ # Non-Repeating Layers
327
+ token_embd\.weight=iq4_kt
328
+ output\.weight=iq6_k
329
+ "
330
+
331
+ custom=$(
332
+ echo "$custom" | grep -v '^#' | \
333
+ sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
334
+ )
335
+
336
+ numactl -N 0 -m 0 \
337
+ ./build/bin/llama-quantize \
338
+ --custom-q "$custom" \
339
+ --imatrix /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/imatrix-Qwen3-Coder-480B-A35B-Instruct-Q8_0.dat \
340
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-BF16-00001-of-00021.gguf \
341
+ /mnt/raid/models/ubergarm/Qwen3-Coder-480B-A35B-Instruct-GGUF/Qwen3-Coder-480B-A35B-Instruct-IQ2_KT.gguf \
342
+ IQ2_KT \
343
+ 192
344
+ ```
345
+
346
+ </details>
347
+
348
+ ## Quick Start
349
+ This example is for a single CUDA GPU hybrid infrencing with CPU/RAM. Check ik_llama.cpp discussions or my other quants for more examples for multi-GPU etc.
350
+
351
+ ```bash
352
+ ./build/bin/llama-server \
353
+ --model /models/IQ2_KS/Qwen3-Coder-480B-A35B-Instruct-IQ2_KS.gguf \
354
+ --alias ubergarm/Qwen3-Coder-480B-A35B-Instruct \
355
+ -fa -fmoe \
356
+ -ctk q8_0 -ctv q8_0 \
357
+ -c 32768 \
358
+ -ngl 99 \
359
+ -ot "blk\.[0-9]\.ffn.*=CUDA0" \
360
+ -ot "blk.*\.ffn.*=CPU \
361
+ --threads 16 \
362
+ -ub 4096 -b 4096 \
363
+ --host 127.0.0.1 \
364
+ --port 8080
365
+ ```
366
+
367
+ ## References
368
+ * [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp)
369
+ * [Getting Started Guide (already out of date lol)](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)