3 7 5

Ann Huang PRO

erinys

https://huggingface.co/erinys

AI & ML interests

None yet

Recent Activity

updated a Space about 1 month ago

xet-team/cas-analysis

reacted to elliesleightholm's post with 🤗 about 1 month ago

I made a beginners guide to Hugging Face Spaces 🤗 I hope it's useful to some of you :) YouTube video: https://www.youtube.com/watch?v=xqdTFyRdtjQ Blog: https://www.marqo.ai/blog/how-to-create-a-hugging-face-space

reacted to jsulz's post with 🔥 about 1 month ago

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in. Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means: ⏩ Only upload the chunks that changed. 🚀 Download just the updates, not the whole file. 🧠 We store your file as deduplicated chunks In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub. We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows? https://huggingface.co/blog/from-files-to-chunks

View all activity

Articles

Organizations

erinys's activity

updated a Space about 1 month ago

Running

📉

CAS Analysis

Visualize a day of global upload traffic on the Hub.

reacted to elliesleightholm's post with 🤗 about 1 month ago

Post

2766

I made a beginners guide to Hugging Face Spaces 🤗 I hope it's useful to some of you :)

YouTube video: https://www.youtube.com/watch?v=xqdTFyRdtjQ

Blog: https://www.marqo.ai/blog/how-to-create-a-hugging-face-space

8 replies

reacted to jsulz's post with 🔥 about 1 month ago

Post

2911

When the XetHub crew joined Hugging Face this fall, @erinys and I started brainstorming how to share our work to replace Git LFS on the Hub. Uploading and downloading large models and datasets takes precious time. That’s where our chunk-based approach comes in.

Instead of versioning files (like Git and Git LFS), we version variable-sized chunks of data. For the Hugging Face community, this means:

⏩ Only upload the chunks that changed.
🚀 Download just the updates, not the whole file.
🧠 We store your file as deduplicated chunks

In our benchmarks, we found that using CDC to store iterative model and dataset version led to transfer speedups of ~2x, but this isn’t just a performance boost. It’s a rethinking of how we manage models and datasets on the Hub.

We're planning on our new storage backend to the Hub in early 2025 - check out our blog to dive deeper, and let us know: how could this improve your workflows?

https://huggingface.co/blog/from-files-to-chunks

reacted to reach-vb's post with 🚀🔥 about 1 month ago

Post

4329

What a brilliant week for Open Source AI!

Qwen 2.5 Coder by Alibaba - 0.5B / 1.5B / 3B / 7B / 14B/ 32B (Base + Instruct) Code generation LLMs, with 32B tackling giants like Gemnini 1.5 Pro, Claude Sonnet
Qwen/qwen25-coder-66eaa22e6f99801bf65b0c2f

LLM2CLIP from Microsoft - Leverage LLMs to train ultra-powerful CLIP models! Boosts performance over the previous SOTA by ~17%
microsoft/llm2clip-672323a266173cfa40b32d4c

Athene v2 Chat & Agent by NexusFlow - SoTA general LLM fine-tuned from Qwen 2.5 72B excels at Chat + Function Calling/ JSON/ Agents
Nexusflow/athene-v2-6735b85e505981a794fb02cc

Orca Agent Instruct by Microsoft - 1 million instruct pairs covering text editing, creative writing, coding, reading comprehension, etc - permissively licensed
microsoft/orca-agentinstruct-1M-v1

Ultravox by FixieAI - 70B/ 8B model approaching GPT4o level, pick any LLM, train an adapter with Whisper as Audio Encoder
reach-vb/ultravox-audio-language-model-release-67373b602af0a52b2a88ae71

JanusFlow 1.3 by DeepSeek - Next iteration of their Unified MultiModal LLM Janus with RectifiedFlow
deepseek-ai/JanusFlow-1.3B

Common Corpus by Pleais - 2,003,039,184,047 multilingual, commercially permissive and high quality tokens!
PleIAs/common_corpus

I'm sure I missed a lot, can't wait for the next week!

Put down in comments what I missed! 🤗

reacted to maxiw's post with 🤗❤️ about 1 month ago

Post

4620

I was curious to see what people post here on HF so I created a dataset with all HF Posts: maxiw/hf-posts

Some interesting stats:

Top 5 Authors by Total Impressions:
-----------------------------------
@merve : 171,783 impressions (68 posts)
@fdaudens : 135,253 impressions (81 posts)
@singhsidhukuldeep : 122,591 impressions (81 posts)
@akhaliq : 119,526 impressions (78 posts)
@MonsterMMORPG : 112,500 impressions (45 posts)

Top 5 Users by Number of Reactions Given:
----------------------------------------
@osanseviero : 1278 reactions
@clem : 910 reactions
@John6666 : 899 reactions
@victor : 674 reactions
@samusenps : 655 reactions

Top 5 Most Used Reactions:
-------------------------
❤️: 7048 times
🔥: 5921 times
👍: 4856 times
🚀: 2549 times
🤗: 2065 times

10 replies

liked a dataset about 2 months ago

openfoodfacts/product-database

Viewer • Updated about 11 hours ago • 3.59M • 681 • 9

liked a Space 2 months ago

Running

📈

Hub Stats

updated a Space 2 months ago

Running

🔥

README

posted an update 2 months ago

Post

2151

🌍 Super cool visualization of global PUT requests to Hugging Face over 24 hours, coded by object size, thanks to @port8080 !

We're putting this analysis to work to help us architect a more geo-distributed system for the HF storage backend.

Originally shared on LinkedIn: https://www.linkedin.com/posts/ajitbanerjee_one-of-the-joys-of-working-on-the-xethub-activity-7252688424732614656-tFGD

upvoted an article 2 months ago

Article

How to optimize your data labelling project with custom interfaces

•

Oct 16

• 18

reacted to jsulz's post with 🔥 3 months ago

Post

1656

The Hugging Face Hub hosts over 1.5M Model, Dataset, and Space repositories. To scale to 10M+, the XetHub team (https://huggingface.co/xet-team) is replacing Git LFS with a new technology that improves storage and transfer capabilities with some future developer experience benefits to boot.

Thanks to @yuchenglow and @port8080 (for their analysis covering LFS usage from March 2022–Sept 2024), we now have insights into what we’re storing. Check out the Gradio app to explore:
- Storage growth over time
- File types over all repositories
- Some simple optimizations we're investigating

xet-team/lfs-analysis

New activity in xet-team/lfs-analysis 3 months ago

Compressed -> Deduped column header

#4 opened 3 months ago by

erinys

liked a Space 3 months ago

Running

📈

Hub LFS Analysis

An analysis of LFS files on the Hub.

upvoted an article 3 months ago

Article

Improving Parquet Dedupe on Hugging Face Hub

Oct 5

• 31

New activity in xet-team/lfs-analysis 3 months ago

Suggested text changes

#1 opened 3 months ago by

erinys

replied to their post 3 months ago

This is great feedback @John6666 - and I've seen your suggestions in the other thread as well. As a non-ML engineer myself, it's been really interesting to explore HF with fresh eyes! We're doing some early exploration on HF understandability and discoverability in our team - would you be open to chatting sometime about potential approaches? We'd love to get your feedback!

liked a dataset 3 months ago

argilla/FinePersonas-v0.1

Viewer • Updated 15 days ago • 42.1M • 6.6k • 372

posted an update 3 months ago

Post

1965

We shut down XetHub today after almost 2 years. What we learned from launching our Git-scaled product from scratch:
- Don't make me change my workflow
- Data inertia is real
- ML best practices are still evolving

Closing the door on our public product lets us focus on our new goal of scaling HF Hub's storage backend to improve devX for a larger community. We'd love to hear your thoughts on what experiences we can improve!

Read the full post: https://xethub.com/blog/shutting-down-xethub-learnings-and-takeaways

6 replies

Ann Huang PRO

AI & ML interests

Recent Activity

Articles

Rearchitecting Hugging Face Uploads and Downloads

From Files to Chunks: Improving Hugging Face Storage Efficiency

Share your open ML datasets on Hugging Face Hub!

Organizations

erinys's activity

CAS Analysis

Hub Stats

README

How to optimize your data labelling project with custom interfaces

Compressed -> Deduped column header

Hub LFS Analysis

Improving Parquet Dedupe on Hugging Face Hub

Suggested text changes