bartowski (Bartowski)

posted an update about 2 months ago

Post

31816

Access requests enabled for latest GLM models

While a fix is being implemented (https://github.com/ggml-org/llama.cpp/pull/12957) I want to leave the models up for visibility and continued discussion, but want to prevent accidental downloads of known broken models (even though there are settings that could fix it at runtime for now)

With this goal, I've enabled access requests. I don't really want your data, so I'm sorry that I don't think there's a way around that? But that's what I'm gonna do for now, and I'll remove the gate when a fix is up and verified and I have a chance to re-convert and quantize!

Hope you don't mind in the mean time :D

1 reply

·

reacted to jsulz's post with 🔥 about 2 months ago

Post

3237

What does it mean when models share the same bytes?

We've investigated some quants and have seen that a considerable portion of quantizations of the same model share the same bytes and can be deduplicated to save considerable upload time for quantizers on the Hub.

This space where we crack open a repo from @bartowski shows we can get significant dedupe xet-team/quantization-dedup

You can get a sense of why by reading this write-up: https://github.com/bartowski1182/llm-knowledge/blob/main/quantization/quantization.md

But what about finetuned models?

Since going into production the

xet-team has migrated hundreds of repositories on the Hub to our storage layer, including classic "pre-Hub" open-source models like FacebookAI/xlm-roberta-large (XLM-R) from

FacebookAI

XLM-R, introduced in 2019, set new benchmarks for multilingual NLP by learning shared representations across 100 languages. It was then fine-tuned on English, Spanish, Dutch, and German, generating language-specific derivations for each - check out the paper here Unsupervised Cross-lingual Representation Learning at Scale (1911.02116)

These finetunes share much of the same architecture and layout as XLM-R with similar training methods and goals. It makes sense that they would share bytes, but it's still fascinating to see.

We put together a similar space to explore these models to see where they overlap - check it out for yourself xet-team/finetune-dedupe

The darker each block in the heatmap, the more the bytes are shared. Clicking on a repos blocks shows all other repos that share blocks.

1 reply

·

reacted to fdaudens's post with 🔥❤️ 4 months ago

Post

9352

Yes, DeepSeek R1's release is impressive. But the real story is what happened in just 7 days after:

- Original release: 8 models, 540K downloads. Just the beginning...

- The community turned those open-weight models into +550 NEW models on Hugging Face. Total downloads? 2.5M—nearly 5X the originals.

The reason? DeepSeek models are open-weight, letting anyone build on top of them. Interesting to note that the community focused on quantized versions for better efficiency & accessibility. They want models that use less memory, run faster, and are more energy-efficient.

When you empower builders, innovation explodes. For everyone. 🚀

The most popular community model? @bartowski 's DeepSeek-R1-Distill-Qwen-32B-GGUF version — 1M downloads alone.

5 replies

·

reacted to ngxson's post with 🔥 4 months ago

Post

3637

Check out my collection of pre-made GGUF LoRA adapters!

This allow you to use both normal + abliterated version of popular models like llama, qwen, etc, without having to double to amount of VRAM usage.

ngxson/gguf_lora_collection

5 replies

·

reacted to ngxson's post with 🚀 5 months ago

Post

3522

I made this small tool that can be useful for debugging Ollama chat template: ngxson/ollama_template_test

CC @bartowski you may need this ;-)

2 replies

·

replied to their post 5 months ago

I don't love the period in the name since I don't like using it for purposes other than the file extension

I don't love the underscore either for what it's worth, but period feels wrong haha

- is probably ideal but then those are used in both author and model names already so the distinction between the two becomes blurred

posted an update 5 months ago

Post

73194

Switching to author_model-name

I posted a poll on twitter, and others have mentioned the interest in me using the convention of including the author name in the model path when I upload.

It has a couple advantages, first and foremost of course is ensuring clarity of who uploaded the original model (did Qwen upload Qwen2.6? Or did someone fine tune Qwen2.5 and named it 2.6 for fun?)

The second thing is that it avoids collisions, so if multiple people upload the same model and I try to quant them both, I would normally end up colliding and being unable to upload both

I'll be implementing the change next week, there are just two final details I'm unsure about:

First, should the files also inherit the author's name?

Second, what to do in the case that the author name + model name pushes us past the character limit?

Haven't yet decided how to handle either case, so feedback is welcome, but also just providing this as a "heads up"

5 replies

·

replied to their post 5 months ago

No it does not include the XS, the reason Q4_0 and IQ4_NL work i think is because they don't do any clever packing of the scaling factors, that's why K quants and IQ4_XS (which is like NL but with some K quant logic) don't work yet

replied to their post 5 months ago

oh, yeah, of course.. I added all the ARM quants but then not Q4_0 which is now the only one that would work haha..

I'll go any make a Q4_0 for it I suppose ! just this once

replied to their post 5 months ago

Don't love adding more formats but if your results are accurate it does seem worth including

replied to their post 5 months ago

I've updated it to "Legacy format, offers online repacking for ARM and AVX CPU inference.", it is still overall legacy but with the online repacking is worth considering for speed

I'm hoping that IQ4_NL gets a few more packing options in the near future

replied to their post 6 months ago

hell yeah. wish we could still offline compile, i get why it's not sustainable in the future but also until there's better support and more options would be nice to keep it around

replied to their post 6 months ago

oh right sorry, forgot to include that PR, i'll add it above but it's here:

https://github.com/ggerganov/llama.cpp/pull/10541

I think the inference engines will just need to update to the newer versions and they'll get the repacking logic for free, if that's what you meant then yes

replied to julien-c's post 6 months ago

This makes perfect sense, average users definitely don't need to be uploading that much stuff privately, great for testing but if it's not worth releasing publicly it's not worth storing on servers for free :)

Great update !

reacted to julien-c's post with 🔥❤️ 6 months ago

Post

10769

After some heated discussion 🔥, we clarify our intent re. storage limits on the Hub

TL;DR:
- public storage is free, and (unless blatant abuse) unlimited. We do ask that you consider upgrading to PRO and/or Enterprise Hub if possible
- private storage is paid above a significant free tier (1TB if you have a paid account, 100GB otherwise)

docs: https://huggingface.co/docs/hub/storage-limits

We optimize our infrastructure continuously to scale our storage for the coming years of growth in Machine learning, to the benefit of the community 🔥

cc: @reach-vb @pierric @victor and the HF team

28 replies

·

posted an update 6 months ago

Post

80385

Looks like Q4_0_N_M file types are going away

Before you panic, there's a new "preferred" method which is online (I prefer the term on-the-fly) repacking, so if you download Q4_0 and your setup can benefit from repacking the weights into interleaved rows (what Q4_0_4_4 was doing), it will do that automatically and give you similar performance (minor losses I think due to using intrinsics instead of assembly, but intrinsics are more maintainable)

You can see the reference PR here:

https://github.com/ggerganov/llama.cpp/pull/10446

So if you update your llama.cpp past that point, you won't be able to run Q4_0_4_4 (unless they add backwards compatibility back), but Q4_0 should be the same speeds (though it may currently be bugged on some platforms)

As such, I'll stop making those newer model formats soon, probably end of this week unless something changes, but you should be safe to download and Q4_0 quants and use those !

Also IQ4_NL supports repacking though not in as many shapes yet, but should get a respectable speed up on ARM chips, PR for that can be found here: https://github.com/ggerganov/llama.cpp/pull/10541

Remember, these are not meant for Apple silicon since those use the GPU and don't benefit from the repacking of weights

17 replies

·

replied to nyuuzyou's post 6 months ago

for what it's worth, it seems like these "limits" always existed but are now just public, they always let people blow through them and gave grants to accounts that were contributing to the community

you can read up VB's response on reddit here:

https://www.reddit.com/r/LocalLLaMA/comments/1h53x33/huggingface_is_not_an_unlimited_model_storage/m03edxo/

But TLDR don't worry about it, this shouldn't interfere with anyone who's using the platform legitimately

posted an update 6 months ago

Post

16488

Old mixtral model quants may be broken!

Recently Slaren over on llama.cpp refactored the model loader - in a way that's super awesome and very powerful - but with it came breaking of support for "split tensor MoE models", which applies to older mixtral models

You may have seen my upload of one such older mixtral model, ondurbin/bagel-dpo-8x7b-v0.2, and with the newest changes it seems to be able to run without issue

If you happen to run into issues with any other old mixtral models, drop a link here and I'll try to remake them with the new changes so that we can continue enjoying them :)

2 replies

·

Bartowski PRO

AI & ML interests

Recent Activity

Organizations

bartowski's activity