Jared Sulzdorf PRO
AI & ML interests
Recent Activity
Organizations
jsulz's activity

Hey @mradermacher just wanted to let you know that we've begun onboarding you to Xet!
All new repos that you create will be Xet-enabled by default. We are still migrating existing repos, so you will see times when there are a mixture of LFS and Xet files side-by-side, but as the migration progresses everything will become Xet.
As I mentioned in my last message, none of this is an issue due to how we've designed the system for backward compatibility, but if you have any questions or concerns, please let me know. Otherwise, I'll follow up here once all your repos are migrated!

Inspired by Tiny Agents in JS from @julien-c , we ported the idea to Python and integrated it directly into
huggingface_hub
— with a built-in MCP Client and a Tiny Agents CLI.TL;DR: With MCP (Model Context Protocol), you can expose tools like web search or image generation and connect them directly to LLMs. It’s simple — and surprisingly powerful.
pip install "huggingface_hub[mcp]>=0.32.0"
We wrote a blog post where we show how to run Tiny Agents, and dive deeper into how they work and how to build your own.
👉 https://huggingface.co/blog/python-tiny-agents

Woohoo!! Thanks for joining ❤️ I'll onboard you from the waitlist soon and follow up here when done.
Will do on the storage side - I'm also quite curious.
If you have any questions or feedback, don't hesitate to ping me here 🤗


We've been onboarding folks https://huggingface.co/blog/xet-on-the-hub know the backend can scale (Llama 4 and Qwen 3 are on Xet), is great for working with quants (see xet-team/quantization-dedup ), and we're pushing on inviting impactful orgs and users on the Hub. You fit the bill.
We'd love to onboard you, get some feedback, and create some excitement 🎉
The steps are pretty straightforward - join the waitlist at hf.co/join/xet and we'll take care of the rest.
The system is fully backward compatible, so you shouldn't notice a thing. BUT to get the best experience when uploading/downloading, make sure you have
hf_xet
installed alongside the latest huggingface_hub
What do you think?
Woohoo! Xet team member here. Thanks for signing up @mradermacher 🤗
The migration process should be very seamless. Because of the way Xet supports backward compatibility - can read about it here if you're interested https://huggingface.co/docs/hub/storage-backends#backward-compatibility-with-lfs - everyone will continue to be able to access the repos before, during, and after the migration.
I'll onboard you from the waitlist this week and then follow up once everything is moved over! If you have any questions, don't hesitate to follow up here and @ me, happy to keep supporting all the work you're doing :)

as you know we're in the process of upgrading our storage backend to xet (which helps us scale and offer blazingly fast upload/ download speeds too): https://huggingface.co/blog/xet-on-the-hub and now that we are certain that the backend can scale with even big models like Llama 4/ Qwen 3 - we;re moving to the next phase of inviting impactful orgs and users on the hub over as you are a big part of the open source ML community - we would love to onboard you next and create some excitement about it in the community too!
in terms of actual steps - it should be as simple as one of the org admins to join hf.co/join/xet - we'll take care of the rest.
p.s. you'd need to have a the latest hf_xet version of huggingface_hub lib but everything else should be the same: https://huggingface.co/docs/hub/storage-backends#using-xet-storage
p.p.s. this is fully backwards compatible so everything will work as it should! 🤗

💬 Qwen made it rain! They released Qwen3: new dense and MoE models ranging from 0.6B to 235B 🤯 as well as Qwen2.5-Omni, any-to-any model in 3B and 7B!
> Microsoft AI released Phi4 reasoning models (that also come in mini and plus sizes)
> NVIDIA released new CoT reasoning datasets
🖼️ > ByteDance released UI-TARS-1.5, native multimodal UI parsing agentic model
> Meta released EdgeTAM, an on-device object tracking model (SAM2 variant)
🗣️ NVIDIA released parakeet-tdt-0.6b-v2, a smol 600M automatic speech recognition model
> Nari released Dia, a 1.6B text-to-speech model
> Moonshot AI released Kimi Audio, a new audio understanding, generation, conversation model
👩🏻💻 JetBrains released Melium models in base and SFT for coding
> Tesslate released UIGEN-T2-7B, a new text-to-frontend-code model 🤩

C5 is a large-scale effort to heavily filter web-crawled data, as collected by the non-profit Common Crawl, to only documents that are Creative Commons-licensed such as cc-by-4.0 or public domain cc0. At this stage 150 billion tokens have been collected.
---
📄 data: BramVanroy/CommonCrawl-CreativeCommons
🧰 software: https://github.com/BramVanroy/CommonCrawl-CreativeCommons
---
</> To build C5, HTML pages are scrutinized and all links (if any) to CC licenses are collected, both in regular hyperlinks as well as in metadata. Additional data fields are included such as "was the license found in the
head
?" or "if multiple licenses were found, do they contradict each other?", which makes further filtering a breeze. 🌐 In this first version of C5, 8 languages are included (Afrikaans, German, English, French, Frysian, Italian, Dutch and Spanish). The language set was limited for two reasons: computational and storage limitations, and a collaboration with GPT-NL, which requested CC data for these languages to train a Dutch-focused, copyright-conscious LLM. In total, this V1 release contains almost 150 thousand documents and 150 billion tokens. This data was not filtered on quality nor deduplicated so that you can decide for yourself how much data to keep. To give some quality indication, a dataset field is present to describe whether a document is included in the FineWeb(-2) datasets, which are of high quality.
🔍 More work needs to be done! Only 7 out of 100+ Common Crawl crawls have been processed so far. That's encouraging because it means there is a lot more Creative Commons data to be collected! But to get there I need help in terms of compute. The current processing was already heavily sponsored by the Flemish Supercomputer but more is needed. If you have the compute available and which to collaborate in an open and transparent manner, please get in touch!

Llama Guard 4 is a new model to filter model inputs/outputs both text-only and image 🛡️ use it before and after LLMs/VLMs! meta-llama/Llama-Guard-4-12B
Prompt Guard 2 22M & 86M are smol models to prevent model jailbreaks and prompt injections ⚔ meta-llama/Llama-Prompt-Guard-2-22M meta-llama/Llama-Guard-4-12B
Both come with new release of transformers 🤗
Try the model right away 👉🏻https://github.com/huggingface/huggingface-llama-recipes/blob/main/llama_guard_4.ipynb
Read our blog to learn more and easily get started 👉🏻 https://huggingface.co/blog/llama-guard-4 🦙