@bartowski on Hugging Face: "Looks like Q4_0_N_M file types are going away Before you panic, there's a new…"

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

bartowski

posted an update Dec 10, 2024

Post

80406

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

altomek

Dec 10, 2024

Huh interesting, however all inference engines need to adapt newer llama.cpp version correct? Q4_0 and IQ4_NL? Just scrolled throught the pull request. How do you know IQ4_NL should work this way also?

bartowski

Dec 10, 2024

oh right sorry, forgot to include that PR, i'll add it above but it's here:

https://github.com/ggerganov/llama.cpp/pull/10541

I think the inference engines will just need to update to the newer versions and they'll get the repacking logic for free, if that's what you meant then yes

concedo

Dec 20, 2024

Btw for anyone late to the game - Q4_0_N_M quants should still work as expected in KoboldCpp. The runtime repack for q4_0 should work as well, so you have multiple options.

bartowski

Dec 20, 2024

hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around

jupiterbjy

Dec 21, 2024

Interesting, in this case will description "Legacy format, generally not worth using over similarly sized formats" of Q4_0 change to something like "ARM recommended (Do not use in Apple Silicons)" - or will IQ4_NL added in list and recommend that over Q4_0?

bartowski

Dec 21, 2024

I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed

I'm hoping that IQ4_NL gets a few more packing options in the near future

yttria

Dec 23, 2024

This comment has been hidden

bartowski

Dec 23, 2024

Don't love adding more formats but if your results are accurate it does seem worth including

urtuuuu

Dec 23, 2024

A bit annoying, isn't it? Some time ago I asked you for arm version of gemma-2-9b-it-abliterated. So now it won't work again. I guess there is no Q4_0 ?

bartowski

Dec 23, 2024

oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..

I'll go any make a Q4_0 for it I suppose ! just this once

Yuma42

Dec 25, 2024

Now that the software I'm using updated the llamacpp version, I'm changingbgguf. I don't get what's meant with IQ4_NL does this include IQ4_XS? So IQ4_XS is also supposed to run performant on arm or just Q4_0?

On a side note, since I had good performance with Q4ks in the past I would wish that that would also benefit from these change.

bartowski

Dec 25, 2024

No it does not include the XS, the reason Q4_0 and IQ4_NL work i think is because they don't do any clever packing of the scaling factors, that's why K quants and IQ4_XS (which is like NL but with some K quant logic) don't work yet

Hugs4Llamas

May 4

Hello Bartowski, you seem to know a lot about the gguf format and llamacpp, maybe you know why the bug in this issue happens which came with the switch to the Q4_0 format? The maintainer had to close the issue because he wasn't able to locate/fix the problem, I was among those who did use your Q4_0_4_4, but with this bug there is no way of benefiting from Q4_0 on edge devices. If you have an idea I would appreciate it you could hint the dev about it.

The issue I'm talking about:
https://github.com/Vali-98/ChatterUI/issues/209

In this post