This discussion has been hidden

#70

by yttria - opened May 26, 2024

May 26, 2024

This comment has been hidden

yttria changed discussion title from SanjiWatsuki/Silicon-Maid-7B perplexity to High perplexity May 26, 2024

yttria changed discussion title from High perplexity to SanjiWatsuki/Silicon-Maid-7B high perplexity May 26, 2024

mradermacher

Owner May 26, 2024

Static quants should be pretty much identical, other than depending on the version of llama.cpp used. If there are significant differences, this is a llama.cpp upstream issue and must be reported there.

mradermacher changed discussion status to closed May 26, 2024

mradermacher

Owner May 27, 2024

A cursory look shows that the tokenizers are quite different. Unless the model changed over time, this would indicate an issue with llama.cpp's conversion script. Given that llama.cpp think what we do here is useless, good luck getting that fixed.

yttria

May 27, 2024

This comment has been hidden

mradermacher

Owner May 27, 2024

Good news then, it's fixed.

yttria

May 27, 2024

This comment has been hidden

mradermacher

Owner May 27, 2024

•

edited May 27, 2024

There is no issue with my quants - they were made with an older version of llama.cpp that generated worse results than current ones. That's true for practically tens of thousands of quants on huggingface. You can request a requant with the current version if you wish, and I will consider it, but fuzzy whataboutism is not going to help. I wish I could just remake the petabyte of quants every time llama.cpp has a bugfix or improvement, but not being able to do that doesn't invalidate older quants.

mradermacher

Owner May 27, 2024

And if you want to go hunting and make a list of affected models, be my guest - I can try to requant them as well, hoping they get better.

yttria

May 27, 2024

This comment has been hidden

mradermacher

Owner May 27, 2024

•

edited May 27, 2024

Then we have a good window of when it had to have been fixed - I usually update llama.cpp at least once per week. Or it's still buggy but trigger conditions are more complex - there are essentially no user-accessible knobs in the process.

mradermacher

Owner May 27, 2024

•

edited May 27, 2024

I have requeued this model, to see if the current llama.cpp generates the same tokenizer output with the set-up from then. Should be done in a few hours.

mradermacher

Owner May 27, 2024

•

edited May 27, 2024

The tokenizer in the new quant looks like the newer quants from richard. Since my quantizer script didn't change, and I used the same settings (which are record in the model card), this shows its a bug in the older version of llama.cpp in use at the time. If you want to track down more models, I would suspect that other mistral or mixtral models might be good candidates. It's unlikely that a lot of models are affected, as at the time, the converter scripts were severely reworked for the llama 3 tokenizer issues.

(I did not measure the perplexity)

yttria

May 28, 2024

This comment has been hidden

mradermacher

Owner May 28, 2024

You can't use perplexity like that. You need to compare exactly the same model.

yttria

Jun 1, 2024

This comment has been hidden

mradermacher

Owner Jun 1, 2024

•

edited Jun 1, 2024

My version is probably done without quantising the output tensor, so likely has slightly higher quality and slightly larger filesize. All my older quants kept the output tensor unquantised.

Addendum: also, inferenceillusionist quantised the source twice, first to f16, then to q4_k_m (according to his model card), causing extra quality loss, while I only quantised once, further improving fidelity to the original model.

Neither should make much of a difference in practise.

yttria

Jun 1, 2024

This comment has been hidden

mradermacher

Owner Jun 2, 2024

•

edited Jun 2, 2024

You seem to confuse perplexity with quality. They are not the same. It's possible that the version of llama.cpp I used would create lower quality quantisations, but the facts are that Inf-I. quantized twice (which loses fidelity to the original model) and my version did not quantize the output tensor (which also guarantees higher fidelity to the original model). That explains the size differences and can also explain the insignificant perplexity differences, because the quants are not identical. This answers your question.

yttria changed discussion title from SanjiWatsuki/Silicon-Maid-7B high perplexity to This discussion has been hidden May 28

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment