This discussion has been hidden
Static quants should be pretty much identical, other than depending on the version of llama.cpp used. If there are significant differences, this is a llama.cpp upstream issue and must be reported there.
A cursory look shows that the tokenizers are quite different. Unless the model changed over time, this would indicate an issue with llama.cpp's conversion script. Given that llama.cpp think what we do here is useless, good luck getting that fixed.
Good news then, it's fixed.
There is no issue with my quants - they were made with an older version of llama.cpp that generated worse results than current ones. That's true for practically tens of thousands of quants on huggingface. You can request a requant with the current version if you wish, and I will consider it, but fuzzy whataboutism is not going to help. I wish I could just remake the petabyte of quants every time llama.cpp has a bugfix or improvement, but not being able to do that doesn't invalidate older quants.
And if you want to go hunting and make a list of affected models, be my guest - I can try to requant them as well, hoping they get better.
Then we have a good window of when it had to have been fixed - I usually update llama.cpp at least once per week. Or it's still buggy but trigger conditions are more complex - there are essentially no user-accessible knobs in the process.
I have requeued this model, to see if the current llama.cpp generates the same tokenizer output with the set-up from then. Should be done in a few hours.
The tokenizer in the new quant looks like the newer quants from richard. Since my quantizer script didn't change, and I used the same settings (which are record in the model card), this shows its a bug in the older version of llama.cpp in use at the time. If you want to track down more models, I would suspect that other mistral or mixtral models might be good candidates. It's unlikely that a lot of models are affected, as at the time, the converter scripts were severely reworked for the llama 3 tokenizer issues.
(I did not measure the perplexity)
You can't use perplexity like that. You need to compare exactly the same model.
My version is probably done without quantising the output tensor, so likely has slightly higher quality and slightly larger filesize. All my older quants kept the output tensor unquantised.
Addendum: also, inferenceillusionist quantised the source twice, first to f16, then to q4_k_m (according to his model card), causing extra quality loss, while I only quantised once, further improving fidelity to the original model.
Neither should make much of a difference in practise.
You seem to confuse perplexity with quality. They are not the same. It's possible that the version of llama.cpp I used would create lower quality quantisations, but the facts are that Inf-I. quantized twice (which loses fidelity to the original model) and my version did not quantize the output tensor (which also guarantees higher fidelity to the original model). That explains the size differences and can also explain the insignificant perplexity differences, because the quants are not identical. This answers your question.