Different Versions

#4
by TheStamp - opened

Much appreciated for these models! I'm curious about the different versions, can you point me to an article or discussion that goes into what the differences are between the files? Like between q4 and q5, and _0, _1, _2, etc?

I'm pretty quick to learn, but its my google-fu has been failing me on the naming convention of these files!

Thanks again!

Yeah some of these methods are very new and there's not much documentation on them.

Here's a table that's part of the README for llama.cpp:

image.png

In this table, ppl means perplexity. It's an artificial benchmark that gives a ballpark estimate for the quality of a model's inference. Lower scores are better.

So looking at 7B, the full unquantised model (F16) scores 5.9565. Q5_1 scores 5.9934 which is 0.07% worse. Whereas q4_0 scores 6.2103 which is around 4% worse than unquantised. The difference between q4_0 and q5_1 is about 3.5%.

Against the quality difference one also needs to consider other factors. RAM usage might decide if you can load the model at all (you never want to be swapping else performance goes through the floor), and then there's inference speed. That's represented in the table by ms/tok which is milliseconds per token. @4th means with using four threads for inference (-t 4) and @8th is 8 threads (-t 8)

Eg for 7B, a q4_0 is 47ms/tok @8th compared to 59 for q5_1. So the q5_1 has better theoretical quality, but will take 25% longer to return a given response.

A user needs to compare all these factors and decide what's most important to them.

For 7B I think most people would be recommended to use the best quantisation available, because 7B is already a small model and the RAM differences are unlikely to affect many people, and 25% slower of "very fast" is still pretty quick.

When we start looking at 30B or 65B models, the RAM differences and speed differences between q4 and q5 would be more significant. It might be you have enough RAM for q4_2 but not for q5_1, perhaps, and if you're only getting two tokens/s, then you might be loathe to slow that down by a further 25%.

Also, the delta between perplexity scores (and thus accuracy) reduces with higher base models, meaning the quality difference from lower bitrate quantisation may be less noticeable because there's more data to begin with. Looking at 13B in the table, the perplexity difference between q5_1 and q4_0 is only 3.2% - a slightly lower delta than the 3.5% we saw at 7B. And the difference will narrow further for the larger models I believe.

So for a 30B or 65B model one may well prefer q4_0 or q4_2 to q5_0 or q5_1, where the opposite may be true at 7B.

This is amazing! Thank you so much, I appreciate your added insight to help in the decision making!

TheStamp changed discussion status to closed

Sign up or log in to comment