public mradermacher discussions

#5
by mradermacher - opened

for discussions not strictly related to daily quant business.

@tdh111full quote, btw:

So, when using --pure, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others. One gets excited about having improved things, but then in practice, with the high-impact tensors quantized with more bits in the quantization mix, suddenly the observed quality is lower than what one had before. Case in point, Q3_K_M with your PR often has a higher PPL than the existing quantization, despite being clearly better with --pure

I see your comments on the PR, how would you rate Ling as a model (in comparison to others you've liked), I may want to run it myself.

@tdh111 It's awesome! I can highly recommend trying it. I already generated over 1 MB worth of text with it during the day where I tried it. With over 6 tokens/second at Q4_K_M it runs super-fast on CPU. This is likely due to it being MoE. The model is highly intelligent and has a deep knowledge for many topics making it perfect for single turn Q&A. It answers questions better than 70B models despite only having 28.8 billion activated parameters. It is clearly beating Nemotron 340B and is maybe even on par with Llama 3 405B. While it is less censored than many other foundational models it still is and due to the model being so good I'm considering renting some GPUs to create an uncensored version of it once axolotl supports it.

@mradermacher wrote:

Besides, even compilade agrees that the improvements are only for a single model family, so even if the table were for Q3_K quants, it would not show otherwise.

What? I only agreed that my approach was slower. The improvements are biggest for Qwen-2.5-Coder-3B-Instruct, but it does also improve results for other families, like @tdh111 (from their comment in the other discussion) noticed in the table in ik_llama.cpp/pull/295

And my approach is a little more interpretable too, since it allows to explicitly choose the ranges of the scales within which an exhaustive cumulative search is made. The approach in ik_llama.cpp is not exhaustive in the ranges searched, and is a bit harder to explain and formalize since it does a lot of things at once which have non-intuitive interactions. The first-order gradient search would not work without the (non-exhaustive) grid search done before.

The time in my approach is dominated by sorting the inverse scales, not actually comparing their weighted squared error. Either a better sorting algorithm or reducing the number of such scales would make it faster. I'm currently exploring both.

What? I only agreed that my approach was slower.

You are taking this out of context. First, I simply reported what ikawrakow said, and second, you are making the same mistake as tdh111 - the discussion was about model quant types, not tensor quant types.

I'd be interested how you ended up here, though? Don't let yourself get used.

You are taking this out of context. First, I simply reported what ikawrakow said, and second, you are making the same mistake as tdh111 - the discussion was about model quant types, not tensor quant types.
[...]
Please, if you want to continue to argue this, argue it with him. Maybe he lied when he wrote that, maybe he changed his mind without saying so - I find it moot to discuss this - point being that I was right in what I reported, even if you clearly don't like it - but it's not my opinion, and I correctly reported what he wrote.

I don't really feel like responding with what is basically just what I said before but slightly restated, either way my point to alert you to this was because an update like this is relevant to you as you are one of the biggest quant makers, and you are nice enough to even allow requant requests on old repos (I don't have any plans to ask [as I'm capable of making quants myself, I just like using your imatrix.dat files] but others may).

I'd be interested how you ended up here, though?

I'm a bit curious about that as well, but am pleasantly surprised.

Edit: I just realized my quotes had an @ and he uses the same username both places. Sorry compilade the @ mention was 100% unintentional, if I had realized I would have edited the quote so that it did not notify you.

@tdh111 It's awesome! I can highly recommend trying it. I already generated over 1 MB worth of text with it during the day where I tried it. With over 6 tokens/second at Q4_K_M it runs super-fast on CPU. This is likely due to it being MoE. The model is highly intelligent and has a deep knowledge for many topics making it perfect for single turn Q&A. It answers questions better than 70B models despite only having 28.8 billion activated parameters. It is clearly beating Nemotron 340B and is maybe even on par with Llama 3 405B.

@nicoboss
I want to try it out, but I'm not sure when I will as Deepseek-V3 for speed, and Deepseek-R1 for quality are my current local choices [well for real speed I run ~30B models on my GPU]. This may be slightly faster than V3 on my hardware, but I'm not sure by enough to warrant using it regularly over V3. [I literally finished my newest speed quant for V3 and have been running it since, its' so fun I haven't even benchmarked it as I want to keep using it for inference]

Also by 1 MB of text do you mean all logits, some logits, or just output tokens or something else. I end up storing the output token and the top 10 most probable tokens alongside their probability, and I also branch a lot which means I can generate GBs of data rather quickly.

While it is less censored than many other foundational models it still is and due to the model being so good I'm considering renting some GPUs to create an uncensored version of it once axolotl supports it.

They fortunately do provide a base model, if you can make your own instruct tuned uncensored version I almost certainly would try that (as that sounds more interesting to me then their current censored instruct tuned model).

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment