Q6 K
Hey, by any chance could you also create a Q6 K file with this format please?
I'm thinking about making two more possibly, e.g. one that is a bit heavier and one that is a bit lighter. I've been testing the -mix-IQ3_K
and it seems really good running locally, on par with DeepSeek-V3-0324
in my anecdotal opinion, but of course it is reasoning so takes a bit longer.
The current -mix-IQ3_k
also barely fits on my rig, so I have to close my firefox browser to free up enough RAM and I'm running a super lean ARCH Linux + xwindows + dwm tiling window manager + alacritty terminal setup. So having a leaner version could be handy as most folks will likely have to run headless or have a little swap space setup to hold their browser RAM haha...
Any specific VRAM+RAM breakpoints you're working with regarding a possible IQ6_K
version? I'd probably go full Q8_0
for all attention layers as they are pretty small, then do IQ6_K/IQ5_K for gate/(up|down) or similar...
Is it worth to move to this from IQ4_XS? Those extra flags like fmoe and rtr have gained me 87.79t/s PP and 9.97t/s. In regular llama.cpp I only get half of that and was going to get smaller unsloth quant but they keep updating it and breaking my downloads.
How fast do your small deepseek run compared to qwen? I know jack about offloading compared to just doing GPU inference so much to learn.
Is it worth to move to this from IQ4_XS?
If you can fit this -mix-IQ3_K
in your rig, I can definitely recommend this over the unsloth UD-Q3_K_XL
which is of similar size class. I don't have numbers on the unsloth IQ4_XS
, but am working on more benchmarks now and hope to do a post on r/LocalLLaMA
soon :tm:.
Here is a sneak peek of what I have already:
Interestingly, bartowski's Qwen3-30B-A3B
~4bpw quants are looking very competitive, hope to work more on that model soon :tm: too!
How fast do your small deepseek run compared to qwen?
In my limited testing on my local rig, I'd choose my Qwen3-235B-A22B-mix-IQ3_K
every time now over my DeepSeek-V3-0324-IQ2_K_R4
or unreleased DeepSeek-R1-GGUF-Q2_K_R4
. Qwen3 is much faster and also the quality feels in limited testing better than V3-324 at least and on par or possibly better than smaller R1
quants at least for coding type tasks.
Here is the speed graph running my Qwen3-235B-A22B-mix-IQ3_K
locally on a 3090TI FE 24GB VRAM + AMD 9950X 2x48GB DDR5-6400 rig.
Also ik may be working on more improvements for GQA FA implementation on his fork which could improve speed even more possibly for Qwen3 and similar models.
Seeing that KLD, I'm glad I didn't waste time with the other quant. D/S is better for creative tasks, unfortunately. I saw surprisingly decent speeds in the ik_llama discussions. My assumption would have been 2-3t/s at best without fancy new generation xeons.
The IQ4 gives me similar outputs to the API, if this does too and generates faster it would be a win.
Sorry to chime up here, but any possibility of Q4_K? I could fit it into my PC using ~20GB of RAM.
Now I'm downloading this one and it should fit fully on VRAM on my case. There wouldn't be any issues by using only CUDA?
EDIT: Tested on full CUDA and working fine! Pretty nice results while testing the model.
@Lockout keep us posted, I'd love to hear if this meets your quality expectations! I've been impressed with it so far.
@Panchovix
oh hey you have all the GPUs yes! Correct, I thought of you while making this model and did not repack the quants to _R4
myself to allow a wider variety of VRAM+RAM combinations to work out of the box. If someone wants to run on RAM they can use -rtr
or the offline rapack tool themselves easily enough without downloading anything more.
If anyone is interested, I have some limited benchmarks for speed and quality on my fresh new Qwen3-30B-A3B-mix-IQ4_K.gguf and hit over 1600 tok/sec PP and 105 tok/sec TG peak on my 3090 TI FE 24GB VRAM!
Heh, it's almost done downloading even, an hour left. I did not find any benefit with ik_llama for full gpu inference. It was slower than mainline. maybe it's different if you pass -fmoe and some of the other flags but dense models lost t/s. Come to think of it.. I can no-shit fully offload this quant too. If I recruit my 5th gpu will have 118gb. Also could install 1 or 2 24gb P40 or a P100.. power consumption not worth it though.
Also getting curious about THP and if that will help. It says to run it without mmap so -rtr is fine.. but do I then turn off the rtr to "benefit". From the issue it says to clear caches when switching so the weights will load from HDD once again. I have dual socket with 1 numa node per. Maybe q3/q4 of qwen is too small to need any of that?
On 128GB VRAM (5090+4090x2+A6000) but slow PCIe, I get lower speeds vs main yes but it's still pretty fast.
At X8/X8/X4/X4, I get 500 t/s pp and 23 t/s while generating (iq3 ikllamacpp)
On main I get same pp but 28t/s while generating (ud q3_k_xl)
On ud q4_k_xl with CPU offloading (20GB ram or so) I get 300 pp and 20 t/s while generating on main llamacpp.
This quant I see some 11.x output token speeds so it's slightly faster.
run it like this with 32k:
--numa distribute \
-ngl 94 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-ub 1024 \
-amb 1024 \
-ot "(1[0-9]).ffn_.*_exps.=CUDA0" \
-ot "(2[0-9]|3[0-8]).ffn_.*_exps.=CUDA1" \
-ot "(4[0-9]|5[0-8]).ffn_.*_exps.=CUDA2" \
-ot "(6[0-9]|7[0-8]).ffn_.*_exps.=CUDA3" \
-ot "([8-9]|[1-9][0-9])\.ffn_.*_exps\.=CPU" \
The ubatch increases PP speed over t/s and makes bigger buffers on cuda. I don't know if amb 1024/512 makes a difference.
PP is now over 100.
ok.. update as to quality....
So here I am torn, the model is more likely to not know what mesugaki means in IQ3, but also reloading the IQ4 has caused it to screw up often as well. Had this strange issue where I got better outputs when I offloaded more to CPU, repetition got to be less as well. I was testing CPU only inference with cuda PP. Now that I run it more, the IQ4 and IQ3 are both printing fairly similar t/s too.