Quant in ~740-700GB range
Which of ik_llama quant types is the best in this range? Are standard ones okay or do you suggest doing some special quant magic?
I'm uploading the IQ4_KS 550.428 GiB (4.604 BPW) right now, should be finished within an hour. Its perplexity is 3.0438
which is not much above the full Q8_0 score of 2.9507
(lower is "better").
So you could go up to fit 768GB RAM rigs but you'll be sacrificing speed for possibly not a ton better quality.
But if you do want something large check out https://huggingface.co/anikifoss/Kimi-K2-Instruct-DQ4_K @anikifoss repo who tends to enjoy the larger quants sizes for maximum accuracy and don't benefit as much from the imatrix relatively speaking.
ik_llama.cpp itself offers some nice iq6_k and iq5_k for often best quality at slight speed hit over the ks
quants which I tend to like as a blend of quality and speed performance.
So give my IQ4_KS or that DQ4_K a go until something bigger comes out if you decide you want that extra long tail of quality!
Thanks for the answer. Since quality is more important for me than speed, I'll make my own quants.
@ChuckMcSneed feel free to adapt the recipes I've proved, basically use a bunch of iq6_k for the larger tensors and iq5_k for the smaller for a combined BPW of about (6.6 + 2*5.5)/3/ ~5.9BPW. 5.9/8.0 * 1016 GiB = 750GiB final size or so.
Holler if you have any questions as you go along, and you're welcome to use my imatrix or none at all depending on your preference.
Finally, if you decide to upload it to HF use the tag ik_llama.cpp
so other quality loving folks can enjoy it!
Cheers!
Any advice on which tensors to keep as bf16 for minimal gain in size, but high gain in quality? I usually just leave embed and output, never looked into the others.
Any advice on which tensors to keep as bf16 for minimal gain in size, but high gain in quality? I usually just leave embed and output, never looked into the others.
none of them! the original model is fp8 naively, the bf16 is only for getting it working on more hardware as native fp8 e4m3 support is available only on >=sm89 (nvidia cuda 4060 and newer).
Check out my recipes and just increase everything a notch or to mixing iq6_k (down) and iq5_k (gate|up) and leave all the attn and anything on GPU at full q8_0 and you'll be barely fitting in 768GB already.
I'm away from office so just a short reply, keep me posted how u get along!
Ahh I just learned about the huggingface safetensor viewer which works like the GGUF viewer. cool!
so the f32's stay as f32's psure and are not quantized.
i'd have to look at the convert_hf_to_gguf.py script to understand exactly what the names of some of those other bf16 tensors correspond to in gguf names.
i see yeah the token_embd (as its called in gguf side) is native bf16. you could leave it that if you wanted i suppose but it will definitely slow things down. it is common practice for smaller dense models to make token_embd q4_K and output "head" q6_K or so.
you can do whatever you like and see how if it suits your needs on your specific rig, go for it!
I've quantized 7b mistral and tested perplexity with ik quants on my personal dataset:
Quant type | PPL |
---|---|
q5_k | 10.6075 |
iq5_k | 10.6180 |
q6_k | 10.5934 |
iq6_k | 10.6208 |
Are they supposed to be used with imatrix? Is that why their ppl is higher than the normal ones?
Are they supposed to be used with imatrix? Is that why their ppl is higher than the normal ones?
Generally don't need imatrix >= 5bpw probably. Different models are different and will likely respond better on some quantization types or othersl. In general ik's quants tend to do better than his older quants (he wrote q6_K and also iq6_K). If you want to know the details go to the ik_llama.cpp github, check out the closed PRs, search for like iq6_K
and go tot he oldest PR and read his comments then decide if it is what you want.
Also feel free to open a discussion on his fork, there are more folks learning how to use the new quants and I have a bunch of measurements floating around too e.g. https://github.com/ikawrakow/ik_llama.cpp/pull/602#issuecomment-3065995863
Also I'm not sure how exactly you're running your perplexity test, it can be very sensitive to the corpus used etc, and i try to keep my stuff consistent between runs to be able to make good comparisons.
If you're interested in more 1 on 1 attention or discussion lemme know and i'm open to consulting if you need that level of discussion.
Looking forward to what you decide to do!
Cheers!
I've done some updated recipes and testing now. I believe I have something with the lowest known perplexity for the given size:
IQ4_KS
2.9584 +/- 0.01473
554.421 GiB 4.638 BPW
This model should be a great combination of maximum accuracy while retaining good speed.
I'm not sure the best way to upload a different revision, but will look into it today.
### I have no proper GPU, pp is slow on CPU
## Attention [0-60]
# Only ik's fork uses this, keep it q8_0 as its only for PP with -mla 3
blk\..*\.attn_kv_b\.weight=q8_0
# ideally k_b and v_b are smaller than q8_0 as they are is used for TG with -mla 3 (and ik's imatrix supports it)
# blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0 or iq4_nl
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# set q_a and q_b to q8_0, since they are small
blk\..*\.attn_q_(a|b)\.weight=q8_0
## First Single Dense Layer [0]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
./llama-quantize --custom-q "$custom" --output-tensor-type bf16 --token-embedding-type bf16 Kimi-K2-384x15B-Instruct-BF16.gguf Q5_K_M
quant size = 699666.91 MB
(I didn't do it on purpose, I swear!)
I've decided to do it like this. Since those tensors are the smallest, (and probably the most impactful) I've decided to keep them in q8_0. By the way, I've noticed that output\.weight=bf16
does not function in the same way as --output-tensor-type bf16
, so keep that in mind for the future. --output-tensor-type bf16
doesn't touch attn_output, so it gets converted to q5_k. I should probably have kept attn_output in q8_0 too... but it's too late now, I'm not requanting it for 12 hours again. Will do it for the base though.
(I didn't do it on purpose, I swear!)
πΉ
By the way, I've noticed that output.weight=bf16 does not function in the same way as --output-tensor-type bf16, so keep that in mind for the future.
I've never tried setting anything to bf16 before, usually if the input GGUF is already bf16 you just omit that line and it will not be quantized and remain bf16 (i think, but double check, i've never actually tried it as bf16 is huge and will slow it down and q8_0 is probably 99.9% same quality) but thanks for sharing how to force it to bf16 if someone wants to do that!
--output-tensor-type bf16 doesn't touch attn_output, so it gets converted to q5_k. I should probably have kept attn_output in q8_0 too...
Right the attn_output
is very different than that final non-repeating output.weight tensor. In my recipe it is set along with the rest of the balance of attn to whatever you like. I see you made your recipe more complex and missed it. Yeah I always have to check the logs for the first few minutes to make sure it is working how I like, as yes it is a PITA to do this again.
I hate to break it to you but that q5_k is going to probably hurt you, :sob: π hahaha... In my recent testing Kimi-K2-Instruct is very sensitive in the attn tensors and I'm going back and increasing size to q8_0 on some quants and getting much better perplexity results.
I mentioned it elsewhere, but My new IQ_KS
is almost identical PPL to the full Q8_0 and will run faster for you. I have been collecting data points and updating a graph here and hope to release some better versions by end of the weekend.
If you want to get faster PP definitely go with -ub 4096 -b 4096
when running that usually helps and use --no-mmap
with this. If you want a little more TG go with -rtr
and use default batch sizes.
Keep me posted how you work out and welcome to the quant cooking game hahaah
I hate to break it to you but that q5_k is going to probably hurt you, :sob: π hahaha... In my recent testing Kimi-K2-Instruct is very sensitive in the attn tensors and I'm going back and increasing size to q8_0 on some quants and getting much better perplexity results.
### I have no proper GPU, pp is slow
## Attention [0-60]
# Only ik's fork uses this, keep it q8_0 as its only for PP with -mla 3
blk\..*\.attn_kv_b\.weight=q8_0
# ideally k_b and v_b are smaller than q8_0 as they are is used for TG with -mla 3 (and ik's imatrix supports it)
# blk.*.attn_k_b.weight is not divisible by 256 so only supports qN_0 or iq4_nl
blk\..*\.attn_k_b\.weight=q8_0
blk\..*\.attn_v_b\.weight=q8_0
# set q_a and q_b to q8_0, since they are small
blk\..*\.attn_q_(a|b)\.weight=q8_0
# attn_output too!
blk\..*\.attn_output\.weight=q8_0
# Turn all of ffn_down_exps to q6_K instead of some
blk\..*\.ffn_down_exps\.weight=q6_K
## First Single Dense Layer [0]
blk\..*\.ffn_down\.weight=q8_0
blk\..*\.ffn_(gate|up)\.weight=q8_0
## Shared Expert [1-60]
blk\..*\.ffn_down_shexp\.weight=q8_0
blk\..*\.ffn_(gate|up)_shexp\.weight=q8_0
Requanted it again, but this time with attn_output=q8_0 and all of ffn_down_exps=q6_K. Final size is 723081.91 MB(= 706 GB), so it still fits in my desired range. Is there anything else I can do to squeeze the most quality out of the last ~20GB?
lol I love that meme
So now that all repeating layers of attn/shexp/blk.0.ffn.*
, and non-repeating layers of final output.weight "head" is q8_0, token_embd is bf16, and all routed exps are q6_K that is pretty much the biggest quant possible!
I'd recommend ik's new iq6_k over q6_K but your own testing you didn't like that so that is fine. Probably wouldn't be noticeable difference in practice.
For comparison I made basically this same quant except smaller output/token_embd non-repeating layers. I for routed exps down@iq5_ks and (gate|up)@iq4_ks and it is already almost indistinguishable from the full Q8_0.
Some PPL data in this crazy chart over here on github issue thread if you want to see exact numbers.
Save your last 20GB of RAM for context and operating system overhead and keeping open 100 tabs in firefox haha....
Nice job! I'd love to hear your final perplexity and any speed benchmarks if you decided to run!
Is -rtr
same as --run-time-repack
? It's not docummented in the help message.
I'd love to hear your final perplexity and any speed benchmarks if you decided to run!
Is there some standardised way to do it?
Is -rtr same as --run-time-repack? It's not docummented in the help message.
Yes. -rtr was introduced a while back in this PR. It will disable mmap()
similar to --no-mmap
and on startup identify any tensors that will run on RAM (vs GPU VRAM for example). It will then convert them on the fly to the _R4
row interleaved variants which can give some performance boost in low batch sizes for PP and possibly a little for TG as well.
However, in more recent PRs there were optimizations made to the non-row-interleaved quants such that even on MoE architectures when running with larger batchses like -ub 4096 -b 4096
it tends to be faster with the non _R4
quants.
In the past I released my models already repacked into _R4 versions. But now with the new optimizaions I no longer do that and allow end users to decide if they want to -rtr
or not. Generally I don't use it now and try to use -ub 4096 -b 4096
personally. But you have flexibility depending on what you're trying to optimize.
Is there some standardised way to do it?
Yes, the perplexity measurement is very sensitive to any variations and I am very careful to be consistent across all my runs such that I can compare the values and make meaning. Here is the command I use for CPU only test. Let me know if u use GPU on that thing and I can update simple command for that:
$ wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
$ gunzip wiki.test.raw.gz
$ sha1sum wiki.test.raw
6f1fe2054a940eebfc76b284b09680763b37f5ea wiki.test.raw
./build/bin/llama-perplexity \
-m "$model" \
-f wiki.test.raw \
-ctk f16 \
--ctx-size 512 \
-fa -fmoe \
-mla 3 \
--seed 1337 \
--threads 64
You can use -ub 4096 -b 4096
for speed ups and get the same result. It is very important that the wiki.test.raw
is the exact same file and that context is exactly 512 e.g it will say n_ctx=512
right before starting the test. Let it run to the end and report the Final estimate: PPL = 3.3505 +/- 0.07133
that prints out at the very end after processing all the chunks.
You can adjust the threads as needed. I like to report values using full f16 unquantized kv-cache. I've measured the differnce and using q8_0
doesn't hurt ppl much. I run q8_0 for actual use, but for reporting values full f16 these days. Keep in mind the defaults are f16 and also ctx 512 so it is okay to just omit those too, i put them in just to make it explicit.
The seed is not important and not used, I just put it there to see if people are finding my posts and using my methodology lol... :grin:
Cheers!
Oh wow! ikawrakow's github got nuked. Not the first time they do shit like that without a reason. My account is still shadowbanned and support doesn't do shit.
Yeah, for anyone who hasn't seen: https://www.reddit.com/r/LocalLLaMA/comments/1m4vw29/ikllamacpp_repository_gone_or_it_is_only_me/