AI & ML interests
None defined yet.
Recent Activity
View all activity
Post
3425
YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).
Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.
Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.
Post
420
Who's going to Raise Summit in Paris Tomorrow ?
If you're around , I would love to meet you :-)
If you're around , I would love to meet you :-)
Post
2458
‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:
1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds
2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach
3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures
4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models
Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder
See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!
1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:
- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds
2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach
3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures
4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models
Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder
See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0
What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

kargaranamir
authored
a
paper
18 days ago
Post
4381
It's been a bit since I took a step back and looked at
xet-team
progress to migrate Hugging Face from Git LFS to Xet, but every time I do it boggles the mind.
A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB
Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).
These are hard numbers to put into context, but let's try:
The latest run of the Common Crawl from
commoncrawl
was 471 TB.
We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
We're moving to a new phase in the process, so stay tuned.
This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.
I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)
Let me know if there's anything you're interested in; happy to dig in!

A month ago there were 5,500 users/orgs on Xet with 150K repos and 4PB. Today?
🤗 700,000 users/orgs
📈 350,000 repos
🚀 15PB
Meanwhile, our migrations have pushed throughput to numbers that are bonkers. In June, we hit upload speeds of 577Gb/s (crossing 500Gb/s for the first time).
These are hard numbers to put into context, but let's try:
The latest run of the Common Crawl from

We now have ~32 crawls stored in Xet. At peak upload speed we could move the latest crawl into Xet in about two hours.
We're moving to a new phase in the process, so stay tuned.
This shift in gears means it's also time to roll up our sleeves and look at all the bytes we have and the value we're adding to the community.
I already have some homework from @RichardErkhov to look at the dedupe across their uploads, and I'll be doing the same for other early adopters, big models/datasets, and frequent uploaders (looking at you @bartowski 👀)
Let me know if there's anything you're interested in; happy to dig in!

multimodalart
posted
an
update
26 days ago
Post
6504
Self-Forcing - a real-time video distilled model from Wan 2.1 by
@adobe
is out, and they open sourced it 🐐
I've built a live real time demo on Spaces 📹💨
multimodalart/self-forcing
I've built a live real time demo on Spaces 📹💨
multimodalart/self-forcing
Preview dark/light toggle
2
#9 opened 3 months ago
by
Aurelien-Morgan


davanstrien
posted
an
update
about 1 month ago
Post
2958
Inspired by Hugging Face's official MCP server, I've developed a complementary tool that exposes my semantic search API to enhance discovery across the HF platform.
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp
Key capabilities:
- AI-powered semantic search for models and datasets
- Parameter count analysis via safetensors metadata
- Trending content discovery
- Find similar models/datasets functionality
- 11 tools total for enhanced ecosystem navigation
The semantic search goes beyond simple keyword matching, understanding context and relationships between different models and datasets.
Example query: "Find around 10 reasoning Hugging Face datasets published in 2025 focusing on topics other than maths and science. Show a link and a short summary for each dataset." (results in video!)
https://github.com/davanstrien/hub-semantic-search-mcp
Preview dark/light toggle
2
#9 opened 3 months ago
by
Aurelien-Morgan

Post
674
🙋🏻♂️ hey there folks ,
So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :
basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!
this is terrible ! but the good news is that we can do something about it !
so there is this "call for opinions and comments" here from the NIH (usa) , and here we can make our opinion on this topic known : https://osp.od.nih.gov/comment-form-responsibly-developing-and-sharing-generative-artificial-intelligence-tools-using-nih-controlled-access-data/
kindly consider dropping your opinion and thoughts about this censorship of science , and share this post , link or thoughts widely .
Together maybe we can start to share data and model weights appropriately and openly in a good way 🙏🏻🚀
cc. @cyrilzakka
So every bio/med/chem meeting i go to i always the same questions "why are you sharing a gdrive link with me for this?" and "Do you have any plans to publish your model weights and datasets on huggingface?" and finally i got a good answer today which explains everything :
basically there is some kind of government censorship on this (usa, but i'm sure others too) and they are told they are not allowed as it is considered a "dataleak" which is illegal !!!!
this is terrible ! but the good news is that we can do something about it !
so there is this "call for opinions and comments" here from the NIH (usa) , and here we can make our opinion on this topic known : https://osp.od.nih.gov/comment-form-responsibly-developing-and-sharing-generative-artificial-intelligence-tools-using-nih-controlled-access-data/
kindly consider dropping your opinion and thoughts about this censorship of science , and share this post , link or thoughts widely .
Together maybe we can start to share data and model weights appropriately and openly in a good way 🙏🏻🚀
cc. @cyrilzakka

KaraKaraWitch
posted
an
update
about 1 month ago
Post
231
"What's wrong with using huggingface transformers?"
Here's a quick example. Am I supposed to be going in with the full knowledge of the inner workings of a LLM model?
Here's a quick example. Am I supposed to be going in with the full knowledge of the inner workings of a LLM model?
import pathlib
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
tokenizer = AutoTokenizer.from_pretrained("<ModernBERT>")
# Triton is **required**, but no where in the documentation is specified that triton is needed.
# Installing triton in windows isn't super straightforward. Thankfully someone has already built wheels for it.
# - https://github.com/woct0rdho/triton-windows/releases
model = AutoModelForSequenceClassification.from_pretrained(
"<ModernBERT>", # reference_compile=False
)
# By default it uses CPU. Which is slow. Move to a cuda device.
# This will actually error out if you use "gpu" instead.
model = model.to("cuda")
with torch.no_grad():
# Not setting `return_tensors="pt"` causes
# File "C:\Program Files\Python310\lib\site-packages\transformers\modeling_utils.py", line 5311, in warn_if_padding_and_no_attention_mask
# if self.config.pad_token_id in input_ids[:, [-1, 0]]:
# TypeError: list indices must be integers or slices, not tuple
# or...
# File "C:\Program Files\Python310\lib\site-packages\transformers\models\modernbert\modeling_modernbert.py", line 836, in forward
# batch_size, seq_len = input_ids.shape[:2]
# AttributeError: 'list' object has no attribute 'shape'
block = tokenizer(
pathlib.Path("test-fic.txt").read_text("utf-8"), return_tensors="pt"
)
block = block.to("cuda")
# **block is needed to fix "AttributeError: 'NoneType' object has no attribute 'unsqueeze'" on attention_mask.unsqueeze(-1)
logits = model(**block).logits
# Not moving to cpu will cause the sigmoid/softmax ops to fail.
logits = logits.to("cpu")
# print(logits)
predicted_class_ids = torch.softmax(logits, -1)[
0
].numpy()

kargaranamir
authored
a
paper
about 1 month ago

kargaranamir
authored
a
paper
about 2 months ago

Reality123b
posted
an
update
about 2 months ago
Post
237
does merging models count as creating a new model myself?
Post
696
With major model families like
Qwen
and all of Llama from
meta-llama
on Xet, the time is right for new users and organizations to say goodbye to LFS on the Hub.
Xet is now the default storage for new AI builders 🚀 🚀 🚀
Just sign up for an account, create a new model or dataset, pip install huggingface_hub and you're off to the races!
Read more here https://huggingface.co/changelog/xet-default-for-new-users
And for everyone with existing repositories, just sign up here https://huggingface.co/join/xet - we'll migrate all existing repositories to Xet and all new repos you create will be Xet-backed by default.


Xet is now the default storage for new AI builders 🚀 🚀 🚀
Just sign up for an account, create a new model or dataset, pip install huggingface_hub and you're off to the races!
Read more here https://huggingface.co/changelog/xet-default-for-new-users
And for everyone with existing repositories, just sign up here https://huggingface.co/join/xet - we'll migrate all existing repositories to Xet and all new repos you create will be Xet-backed by default.

mariagrandury
authored
2
papers
about 2 months ago
Post
2523
🙋🏻♂️ Hey there folks ,
Yesterday the world's first "Learn to Vibe Code" application was released .
As vibe coding is the mainstream paradigm , so now the first educational app is there to support it .
You can try it out already :
https://vibe.takara.ai
and of course it's entirely open source, so i already made my issue and feature branch :-) 🚀
Yesterday the world's first "Learn to Vibe Code" application was released .
As vibe coding is the mainstream paradigm , so now the first educational app is there to support it .
You can try it out already :
https://vibe.takara.ai
and of course it's entirely open source, so i already made my issue and feature branch :-) 🚀