Bartowski PRO

bartowski

AI & ML interests

None yet

Organizations

bartowski's activity

replied to their post 8 days ago
view reply

Bf16 can't be offloaded to GPUs so imatrix becomes slow to make :')

posted an update 13 days ago
view post
Post
6220
Reposting from twitter:

Just so you all know, I'll be on vacation for the following two weeks and away from home! I'm hoping to get on at least once a day to load up some quants, but I won't be as bleeding edge and on the ball :) feel free to shoot me a message if you see one I should make!

In the meantime if you need something bleeding edge make sure to check out @MaziyarPanahi or @bullerwins who both put out great work!
  • 2 replies
·
replied to their post 14 days ago
view reply

I suppose I should add, that this is more valuable as a pseudo comparison to bf16

Since bf16 can represent the range (1, -1) with more precision than fp16, there is much debate as to whether it's safe to convert from bf16 to fp16, or if you should keep bf16, or even upcast to fp32, in order to preserve the original quality of the model for as long as possible before quantizing to 8 bits

This test shows that fp16 is capable of represent 99.97% of the weights in an FP32 model precisely, and therefore represents a negligible at best difference

Additionally, since the weights it can't represent are between 6e-5 and -6e-5, the weights it can't represent are so small that they most likely do not contribute to the finally output of the model and are relatively safe to prune

posted an update 14 days ago
view post
Post
5663
Decided to try to check how many weights in a 70b F32 model would be squashed when converted to F16 (spoiler, it's shockingly few)

The reason for this comparison is that it should represent the same percentage of squishing as bf16 to fp16

Had claude make me a script, using the new Reflection-70B, and these are the results:

Total weights: 70553706496
Fully representable: 70530215524
Squashed: 23490972
Percentage squashed: 0.03%

0.03%!!!!

A couple things to note, this uses a roundtrip of F32 -> F16 -> F32 and then torch.isclose to account for rounding errors that come up by the very nature of extremely accurate numbers, but it uses VERY small tolerances (rtol=1e-5, atol=1e-8)

This is also examining EVERY weight that was stored at F32, and for most layers I was somewhere between 0% and 0.03% of weights being squashed, no major outliers.

Overall, I feel even safer converting to F16 for llama.cpp, the extremely small number of weights that fall outside the range are likely so small that they don't actually play a role in the final output of the model at inference anyways.
·
replied to their post 22 days ago
view reply

also maybe there should be a new feature to be explicitly notified about new repositories

That would be amazing, probably for average users but especially for me, where I sometimes stumble upon a model uploaded days ago that I somehow didn't notice from a creator I enjoy

We will have to see if something like that is possible without cluttering up the profile pages too much. But we'll try.

That sounds awesome, could even consider something like a toggle in the settings for "show this model on my page" or something, and possibly as a variable when using huggingface-cli or the HF python API

I think we'll be doing a social features sprint soon and this is exactly the kind of feedback we need! Thank you so much!

Beautiful, I love this :D If you need feedback on anything specific feel free to reach out, would love to be a guinea pig or just early eyes !

posted an update 22 days ago
view post
Post
4516
@victor (is this the only way to "DM" on HF?)

Had a funny thought, would it be at all possible to rework what shows up on our personal HF page?

Picture this: I upload a model to an organization, someone who follows me now has no idea that I've uploaded a model or to where, unless they also watch those repos (which also floods them with other notifications)

What if our main Huggingface page was a collection of both models that we've uploaded specifically to our profile, as well as models we've uploaded to organizations? That way it would all be contained in one central followable location, and I wouldn't have concerns about losing followership if I wanted to upload to an organization all of a sudden.
·
replied to victor's post 27 days ago
view reply

Oh another big pain point: notifications

I would love to be able to subscribe to be notified of new models posted by people or organizations, but it's near impossible as is

replied to victor's post 28 days ago
view reply

I would love better filtering

First I think sort by created is broken, but haven't checked on desktop recently

Second, I would love date filtering, like show me trending models that were only posted or updated in the past 7 days and such

replied to clem's post 28 days ago
view reply

I'm happy to hear this too, money in the bank is good, but upwards momentum makes it so much easier to justify investing in new technology and improving things!

posted an update about 1 month ago
view post
Post
9902
So turns out I've been spreading a bit of misinformation when it comes to imatrix in llama.cpp

It starts true; imatrix runs the model against a corpus of text and tracks the activation of weights to determine which are most important

However what the quantization then does with that information is where I was wrong.

I think I made the accidental connection between imatrix and exllamav2's measuring, where ExLlamaV2 decides how many bits to assign to which weight depending on the goal BPW

Instead, what llama.cpp with imatrix does is it attempts to select a scale for a quantization block that most accurately returns the important weights to their original values, ie minimizing the dequantization error based on the importance of activations

The mildly surprising part is that it actually just does a relatively brute force search, it picks a bunch of scales and tries each and sees which one results in the minimum error for weights deemed important in the group

But yeah, turns out, the quantization scheme is always the same, it's just that the scaling has a bit more logic to it when you use imatrix

Huge shoutout to @compilade for helping me wrap my head around it - feel free to add/correct as well if I've messed something up
·
replied to their post about 1 month ago
view reply

much more difficult though if you're trying to iterate, definitely an interesting final validation

replied to their post about 1 month ago
view reply

oh god dammit haha, i did not think of that possibility AT ALL 🤦

KL Divergence is almost identical - though even then upsetting that it's "almost" - but yup there's huge differences in the top p...

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.339378 ±   0.038949
Mean PPL(base)                :   6.337070 ±   0.038896
Cor(ln(PPL(Q)), ln(PPL(base))):  99.99%
Mean ln(PPL(Q)/PPL(base))     :   0.000364 ±   0.000067
Mean PPL(Q)/PPL(base)         :   1.000364 ±   0.000067
Mean PPL(Q)-PPL(base)         :   0.002308 ±   0.000427

====== KL divergence statistics ======
Mean    KLD:   0.000005 ±   0.000001
Maximum KLD:   0.113848
99.9%   KLD:   0.000346
99.0%   KLD:   0.000055
99.0%   KLD:   0.000055
Median  KLD:   0.000001
10.0%   KLD:  -0.000014
 5.0%   KLD:  -0.000021
 1.0%   KLD:  -0.000035
Minimum KLD:  -0.000120

====== Token probability statistics ======
Mean    Δp:  0.002 ± 0.000 %
Maximum Δp: 19.102%
99.9%   Δp:  0.417%
99.0%   Δp:  0.155%
95.0%   Δp:  0.067%
90.0%   Δp:  0.040%
75.0%   Δp:  0.010%
Median  Δp:  0.000%
25.0%   Δp: -0.007%
10.0%   Δp: -0.034%
 5.0%   Δp: -0.062%
 1.0%   Δp: -0.154%
 0.1%   Δp: -0.439%
Minimum Δp: -5.820%
RMS Δp    :  0.078 ± 0.016 %
Same top p: 99.927 ± 0.007 %
replied to their post about 1 month ago
view reply

Either way I appreciate the insight and now question all my life decisions, especially the ones that involved me uploading fp32 files and spending 3x the time calculating imatrix on bf16 instead of fp16

replied to their post about 1 month ago
view reply

just for my own curiousity, I ran my fp16 conversion vs the fp32 KLD and got this:

====== Perplexity statistics ======
Mean PPL(Q)                   :   6.341096 ±   0.038970
Mean PPL(base)                :   6.337070 ±   0.038896
Cor(ln(PPL(Q)), ln(PPL(base))):  99.99%
Mean ln(PPL(Q)/PPL(base))     :   0.000635 ±   0.000085
Mean PPL(Q)/PPL(base)         :   1.000635 ±   0.000085
Mean PPL(Q)-PPL(base)         :   0.004026 ±   0.000543

====== KL divergence statistics ======
Mean    KLD:   0.000199 ±   0.000001
Maximum KLD:   0.066079
99.9%   KLD:   0.002797
99.0%   KLD:   0.001297
99.0%   KLD:   0.001297
Median  KLD:   0.000126
10.0%   KLD:   0.000001
 5.0%   KLD:  -0.000001
 1.0%   KLD:  -0.000012
Minimum KLD:  -0.000115

====== Token probability statistics ======
Mean    Δp:  0.005 ± 0.001 %
Maximum Δp:  6.664%
99.9%   Δp:  2.699%
99.0%   Δp:  1.539%
95.0%   Δp:  0.794%
90.0%   Δp:  0.483%
75.0%   Δp:  0.108%
Median  Δp:  0.000%
25.0%   Δp: -0.098%
10.0%   Δp: -0.466%
 5.0%   Δp: -0.779%
 1.0%   Δp: -1.518%
 0.1%   Δp: -2.630%
Minimum Δp: -9.853%
RMS Δp    :  0.493 ± 0.002 %
Same top p: 99.106 ± 0.024 %

so it looks like there IS a difference, but I guess after you quantize there's just so much more noise it's irrelevant (or, as you said, because "the scales for quantized data are always FP16")

The only thing then is to determine if it matters at all for the imatrix, but it seems unlikely considering my Q4_K_M results are at best statistical noise

replied to their post about 1 month ago
view reply

Alternatively, I suppose it's possible that any values between 0 and 6e-5 are so small, that truncating them to 0 is the exact same as leaving them at full precision - they're just so tiny that their values don't change any perceived results (after quantization)

replied to their post about 1 month ago
view reply

Yeah the BF16 -> FP32 being lossless makes sense to me, I'm just surprised that BF16 -> FP16 -> Q8 is identical to BF16 -> FP32 -> Q8, unless ALL values are within that range as you mentioned I would expect at minimum some noise

I could possibly find a way to check if all the weights are in that interval, and if they are, that would mean that fp16 is also lossless I suppose

But basically you're suggesting that at the end of the day, whether I convert to FP32, BF16, or FP16 (assuming a BF16 origin), the arithmetic in llama.cpp will make it so that it's irrelevant?

replied to their post about 1 month ago
view reply

Here's a table showing the main results:

Metric Q4_K_M from FP32 Q4_K_M from FP16 Q8_0 from FP32 Q8_0 from FP16
Mean PPL(Q) 6.445459 ± 0.039767 6.445574 ± 0.039771 6.344932 ± 0.038989 6.344932 ± 0.038989
Mean PPL(base) 6.337070 ± 0.038896 6.337070 ± 0.038896 6.337070 ± 0.038896 6.337070 ± 0.038896
Cor(ln(PPL(Q)), ln(PPL(base))) 99.62% 99.62% 99.98% 99.98%
Mean PPL(Q)/PPL(base) 1.017104 ± 0.000548 1.017122 ± 0.000549 1.001241 ± 0.000131 1.001241 ± 0.000131
Mean KLD 0.018110 ± 0.000112 0.018119 ± 0.000114 0.000859 ± 0.000005 0.000859 ± 0.000005
Maximum KLD 3.371759 2.833701 0.377813 0.377813
Median KLD 0.009176 0.009167 0.000549 0.000549
Mean Δp -0.256 ± 0.010 % -0.251 ± 0.010 % -0.017 ± 0.002 % -0.017 ± 0.002 %
RMS Δp 3.966 ± 0.033 % 3.978 ± 0.033 % 0.848 ± 0.007 % 0.848 ± 0.007 %
Same top p 93.893 ± 0.062 % 93.864 ± 0.062 % 98.515 ± 0.031 % 98.515 ± 0.031 %
posted an update about 1 month ago
view post
Post
6082
As some of you know, I try to convert models to either fp32 or bf16 depending on theirs size before doing imatrix and quantization

Today I decided to see if that matters, and the results have me.. for lack of a better word, perplexed

My setup:

Mistral Nemo Instruct 2407
- convert to FP32, calculate imatrix, quantize to Q8_0 and Q4_K_M
- convert to FP16, calculate imatrix, quantize to Q8_0 and Q4_K_M

I calculated the kld base from the FP32 model:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-f32.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld -ngl 35 -fa -sm row

then calculated the divergence itself for each like so:
./llama-perplexity -m /models/Mistral-Nemo-Instruct-2407-Q8_0.gguf -f /training_data/wikitext-2-raw/wiki.test.raw --kl-divergence-base /training_data/mistral-nemo-f32.kld --kl-divergence -ngl 50 -fa -sm row

Q4_K_M from fp16 and fp32 were similar, trading blows across statistics, odd since i expected fp32 to be strictly better but it's not

Q8_0 is where things get weird. Despite each file being slightly different size, and the sha256sum of course being different, they each get *completely identical* scores, down to 6 decimal places of precision on the statistics.

How is this possible? Is there something I don't understand about llama.cpp that makes it always convert to fp16 before it does quantization? Am I wasting time using FP32/BF16??
·