would you consider making a Q5 version with page size of 16?

#1
by deleted - opened
deleted

as title, thanks for these!

Unfortunately mlx only allows for 32, 64, and 128 group size. If you have the memory for it, I could upload a q6-hi, it is quite a bit better

One thing I noticed with the hi quants that you can reduce the number of experts to cut down on thinking, and it still works. The q6-hi seems to work best with 6-7 experts per token. Was able to load it on a 128GB Mac with limited context, and worked fine. It’s uploading now.

Many thanks, yeah downloading the Q6 hi, should be perfect for 128GB with maybe a slight allocation bump over the default 96GB VRAM. Hoping it holds together better over longer context / outputs.

deleted

Q6 hi: 87.56GB fresh load, ~90.5GB spike during prompt processing 10k context. This is looking perfect for default allocation with headroom at short context, or not too badly red-lining for longer context. Thank you!

Awesome, thank you for the confirmation!
You could also try removing one expert if it thinks too long--I found that in the "hi" quants they are "smarter" and bump heads if too many in the room

deleted

Really interesting about the experts -shall play around.
It already feels stronger than standard Q6 MLX.

The q6-hi should be around q8 quality, with the added benefit of speed and "eagerness to solve". That is a behavior I noticed, as opposed to the ones encoded with group size 64. This does not always offer a benefit, and I imagine creative writing might be affected, but in coding it does make a difference because I see it go the extra mile to finalize a code bit, and not leave ellipsis all over the place to come later to fix.

Try also the rep.pen to 1.08, even down to 1.02 for expedience. The usual Qwen "MoE squeezes" work the same way for GLM

deleted

Thanks for the new Unsloth versions!

Would you be willing to convert Unsloth's new fixed GLM 4.5 to a 2bit MLX?

I tried, it just doesn't want to work in MLX with smaller than 3bit, becomes unstable, gets in loops and is quite limited in what it can do.

Interesting: that's been my experience with the other 2bit mlx versions too, but Unsloth IQ2_XXS GGUF holds together much better.
Not sure what else they've done, or how else MLX affects output.

Mlx sets some limits how crazy you can go combining layers and precisions, and usually the result is not as good as expected. There is more combinatorial freedom with gguf 😀

deleted

In this case - hoping for an MLX update. Given that GGUFs are running somewhat well now on Macs with flash attention, do you think there are any hardware reasons why apple couldn't make MLX as flexible?

I don’t think so, just less practical options. The dwq quants are amazing when the training works, others are touch-and-go depending on the dataset. I has some success with mixed precision qx quants, but had to wire the layers myself because there are no quant predicates for 6/8 bits, and definitely none that combine 5/6/8. If you go to lower combinations there are available predicates, but the quality is low.

I get regularly better output from a qx6/qx6-hi quant than from a q8 or in some cases even bf16

deleted

Thanks for these details. I somewhat understand why quantisation can help improve some outputs (I'm guessing more the shorter ones) but struggle to see how long complex outputs like code won't suffer greater degradation?

Mainly trying to maximise input and especially output length while preserving as much quality as possible.

Yeah, you mean like seriously coding, not just vibe coding. I would not go below q6 on any model. The smaller the model, the more it loses with small quants. The 4B (any 4B) loses its character below q6, and I could rescue a few by quanting qx5(but then size increases, so there is little benefit). Even if I had the qx4 looking like it's working, emoticons and all, it was hallucinating success, and you see that a few messages downstream, as you mentioned. The larger models can tolerate smaller bit quants only because few people care to examine the parent model character first, before expecting it to code some Python boilerplate. I got a 0.6B to code Haskell, coding is not the test.

Let me give you an example. I was working with a Qwen MoE a few days ago, trying to figure out why my fresh qx6 quant doesn't want to write PL/Perl for a subset of my project. It just ignored my prompt and I got some clean PL/PGSQL instead, well written. When I looked in the think tag, it considered it, but then changed its mind because for that part of the project it just wasn't necessary, and PL/PGSQL would do well. Not only that, but self-explaining that if the users asks, make a case for it, otherwise let's just streamline a good product. It did not bother to make a case in the open tag, it just wasn't worth the conversation to it.

This kind of behavior I observed only from the qx6, while the q6 and q8 faked it by naming the section plperl and writing PL/PGSQL instead, and that is because most MoEs are invariably ignorant of how to properly implement PL/Perl transactions, see it as high complexity, and fake their way out with something. You might think that in end effect I got the same thing, but the qx6 at least offered more refinement and properly named the function as PL/PGSQL, so it did not lie to me. Not openly, just softly directed my approach so it has less work to do

If you see I upload 10 quants of a model, I am testing something. Pick q6/qx6 as the safe bets

If you see I only upload one quant, that was the golden quant. The best I could get out of that model, and there was nothing else worth uploading to compare to it. I have a few cases of that, when a q6 or even a q5 (even without hi) just blew me away. It doesn't have to be coding, it's a special something about that model, sometimes silly to put on the model card, but sure that someone else will find the same thing. Box of chocolates

From recent benchmarks I ran on various models, I have numbers to prove that qx and the hi quants reach the apex of their compression level, some better than bf16. The full precision hunts in the think tag for five minutes and gives me boilerplate, the qx6 thinks for 15 seconds, plans on the go, and delivers above expectations. There are a few models like that in my collection that are just go-go-go, and they keep track of stuff well too. Didn't know why until I ran tests.

If you're looking for a small model that has some good character and solid output, check this one out

https://huggingface.co/nightmedia/QiMing-Janus-q6-hi-mlx

It's based on a Qwen3-14B with 40k context and it can be roped into oblivion, keeps track of stuff well in its model class, and "knows" more code than a MoE.

In this case the hi model was better than the qx model, and that didn't happen with the parent model, so it really also depends on the training. This is definitely the case for this model, as its thinking pattern has been changed, so old tricks don't work.

I am evaluating the whole series and have uploaded most QiMing models, they are surprisingly good, and distinctly different than their parent.

Have seen similar with different quants, especially with the reasoning models: seems a good quant cuts through the noise for shorter tasks, but a bad one makes confidently incorrect hallucination much more likely.

With a few models I've seen BF16 be more likely to fail (due to getting lost building out too many branches or too deep) but also having a small chance of something amazing finishing that a quant couldn't explore.

Appreciate you drawing attention to the eagerness to complete and short cut tactics - showing different quant flavours affecting: shall keep a closer eye on that and think about some testing and mitigation approaches.

Looking for; an architect / complex model for long code outputs, also a faster model for iteration and tweaks.
Blown away by Unsloth's 2bit GLM GGUF, but it is borderline struggling: not very sure of ideas in the thinking tags - but manages to hold together surprisingly well and sometimes produces code quality and length no other local model I've tried can: ≥20k functional output.
Qwen2.5 72B has output up to 16k complex code, good rigour but not very creative.
Testing Kimi-Dev 72B - too early to share results but good initial signs on quality and length.
GLM Air is solid- but does feel sensitive to the MOE issues you highlight and struggles with long complex code output, seems stable to ~12k output.
Qwen3 Coder A3B - great for smaller sections but just doesn't have anywhere near enough active parameters to handle; complex interactions and longer outputs.

Also very interested to try your Seed-OSS versions next: the assignable thinking budget seems a game changer, also the 36B size for 128GB machines gives some interesting context, thinking length and quantisation combination possibilities.
I see a fp16 MLX version available ~73GB:

Given it's long context abilities - thinking of trying;
FP16, full precision KV cache - for ≤32k high complexity code design
FP16 with KV quant - for less important and or longer outputs.
Q8-hi, FP KV - for even longer outputs

Running some rough theoretical numbers - seems a Q8, FP KV at 64K-128K could suffer -5% to -10% Degradation on MMLU and 15-25% error rate - perhaps stable 20-30k complex code output?
Q8-hi, Q6hi?

Any thoughts on Seed-OSS and which quant flavours work best for getting the most of its long context and assignable thinking?
Wouldn't mind even a very slow option for overnight planning or module mergers.

More aggressive quant = more thinking space + more chance of nuance, hallucination, error accumulation and drift.
Very interested where the sweet spot balance for thinking budget and quantisation is, relative to different context, input and output lengths - on different goals like; quality, error types and rates, creativity, rigour and success rate.

Seems that Seed-OSS gives us a lot of options to try.

Thanks for the link - QiMing looks very interesting!

deleted

Some interesting details in that paper: highlighting strengths and weaknesses of different approaches for different purposes and trade offs.
Like that the paper is looking in details at energy use too.

Glad to find that paper - seems to warn away from using KV quants if possible - makes it even more important that MLX become as flexible as GGUFs are - to give MLX quant makers more ability to customise for different use cases and hardware.

It really depends on the architecture. I have seen Qwen3 respond to some extreme methods quite well, others not so much. Llamas are right behind Gemmas in being weird about quanting. They are right about balance though. I saw that in some of the qx/qx-hi quants that they push one edge towards performance and lose a bit in the social skills, while the q-hi enhanced ones got better, but stayed boring. The best so far in being malleable are the 4B and the 30B qwens--especially the MoE. You can get unusually good traces from a recipe that utterly failed on a dense model

I am seeing this in the latest Hermes, weird about anything but q4, q6, or q8. Uploading a q4-hi that seems to be less weird about being dense

Made a new one with mixed layers, worked great on a long context prompt

Sign up or log in to comment