BigScience Data

non-profit

https://bigscience.huggingface.co

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

stellaathena authored a paper about 2 hours ago

Open Problems in Mechanistic Interpretability

christopher new activity 2 days ago

bigscience-data/sgpt-bloom-1b7-nli:Adding `safetensors` variant of this model

lvwerra authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

View all activity

bigscience-data's activity

stellaathena

authored a paper about 2 hours ago

Open Problems in Mechanistic Interpretability

Paper • 2501.16496 • Published 2 days ago • 7

christopher

in bigscience-data/sgpt-bloom-1b7-nli 2 days ago

Adding `safetensors` variant of this model

#2 opened 3 days ago by

SFconvertbot

lvwerra

authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

guipenedo

authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

Pclanglais

authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

stellaathena

authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

thomwolf

authored a paper 13 days ago

Towards Best Practices for Open Datasets for LLM Training

Paper • 2501.08365 • Published 15 days ago • 51

meg

posted an update 16 days ago

Post

2954

💫...And we're live!💫 Seasonal newsletter from ethicsy folks at Hugging Face, exploring the ethics of "AI Agents"
https://huggingface.co/blog/ethics-soc-7
Our analyses found:
- There's a spectrum of "agent"-ness
- *Safety* is a key issue, leading to many other value-based concerns
Read for details & what to do next!
With @evijit , @giadap , and @sasha

yjernite

posted an update 16 days ago

Post

2149

🤗👤 💻 Speaking of AI agents ...
...Is easier with the right words ;)

My colleagues @meg @evijit @sasha and @giadap just published a wonderful blog post outlining some of the main relevant notions with their signature blend of value-informed and risk-benefits contrasting approach. Go have a read!

https://huggingface.co/blog/ethics-soc-7

albertvillanova

posted an update 23 days ago

Post

1919

Discover all the improvements in the new version of Lighteval: https://huggingface.co/docs/lighteval/

lhoestq

authored a paper about 1 month ago

Croissant: A Metadata Format for ML-Ready Datasets

Paper • 2403.19546 • Published Mar 28, 2024 • 1

yjernite

posted an update about 2 months ago

Post

2190

🇪🇺 Policy Thoughts in the EU AI Act Implementation 🇪🇺

There is a lot to like in the first draft of the EU GPAI Code of Practice, especially as regards transparency requirements. The Systemic Risks part, on the other hand, is concerning for both smaller developers and for external stakeholders.

I wrote more on this topic ahead of the next draft. TLDR: more attention to immediate large-scale risks and to collaborative solutions supported by evidence can help everyone - as long as developers disclose sufficient information about their design choices and deployment contexts.

Full blog here, based on our submitted response with @frimelle and @brunatrevelin :

https://huggingface.co/blog/yjernite/eu-draft-cop-risks#on-the-proposed-taxonomy-of-systemic-risks

2 replies

lhoestq

posted an update about 2 months ago

Post

1772

Made a HF Dataset editor a la gg sheets here: lhoestq/dataset-spreadsheets

With Dataset Spreadsheets:
✏️ Edit datasets in the UI
🔗 Share link with collaborators
🐍 Use locally in DuckDB or Python

Available for the 100,000+ parquet datasets on HF :)

thomwolf

posted an update about 2 months ago

Post

5059

We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of 🗣️languages.

We applied the same data-driven approach that led to SOTA English performance in🍷 FineWeb to thousands of languages.

🥂 FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive 📜 ODC-By 1.0 license, and the 💻 code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a 📝 blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi

2 replies

christopher

posted an update about 2 months ago

Post

1635

The folks at Foursquare released a dataset of 104.5 million places of interest ( foursquare/fsq-os-places) and here's all of them on a plot

4 replies

christopher

posted an update about 2 months ago

Post

2380

The Lichess database of games, puzzles, and engine evaluations is now on the Hub: https://huggingface.co/Lichess

Billions of chess data points to download, query, and stream and we're excited to see what you'll build with it! ♟️ 🤗

- Lichess/positions-datasets-66f50837db5cd3287d60d489
- Lichess/games-datasets-66f508df78f4b43e1bb2d353

thomwolf

posted an update about 2 months ago

Post

1401

Exponentially growing number of open-source AI models over the course of the past 30 months – from a few thousands to over 1 million and more

Interactive data viz: huggingface/open-source-ai-year-in-review-2024

thomwolf

posted an update about 2 months ago

Post

1475

Most liked and most downloaded open-source AI models from 2022 to 2024

Interactive viz: https://aiworld.eu/embed/model/model/treemap
Discussion: huggingface/open-source-ai-year-in-review-2024

loubnabnl

posted an update 2 months ago

Post

2071

Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit 🛠️ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?

thomwolf

posted an update 2 months ago

Post

1709

Interesting long read from @evanmiller-anthropic on having a better founded statistical approach to Language Model Evaluations:
https://www.anthropic.com/research/statistical-approach-to-model-evals

Worth a read if you're into LLM evaluations!

Cc @clefourrier

1 reply

AI & ML interests

Recent Activity

Team members 72

bigscience-data's activity

Adding `safetensors` variant of this model