BigCode Data

non-profit

BigCodeProject

bigcode-project

Activity Feed Request to join this org

AI & ML interests

None defined yet.

Recent Activity

yjernite authored a paper 17 days ago

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

yjernite authored a paper 17 days ago

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

yjernite authored a paper 17 days ago

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

View all activity

yjernite

authored 3 papers 17 days ago

The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources

Paper • 2406.16746 • Published Jun 24, 2024

In-House Evaluation Is Not Enough: Towards Robust Third-Party Flaw Disclosure for General-Purpose AI

Paper • 2503.16861 • Published Mar 21 • 1

A Different Approach to AI Safety: Proceedings from the Columbia Convening on Openness in Artificial Intelligence and AI Safety

Paper • 2506.22183 • Published Jun 27

yjernite

posted an update 28 days ago

Post

4052

𝗙𝗶𝗿𝘀𝘁 𝗚𝗣𝗔𝗜 𝗠𝗼𝗱𝗲𝗹 𝘄𝗶𝘁𝗵 𝗘𝗨 𝗗𝗮𝘁𝗮 𝗧𝗿𝗮𝗻𝘀𝗽𝗮𝗿𝗲𝗻𝗰𝘆 𝗧𝗲𝗺𝗽𝗹𝗮𝘁𝗲? 🇪🇺

With the release of the EU data transparency template this week, we finally got to see one of the most meaningful artifacts to come out of the AI Act implementation so far (haven't you heard? AI's all about the data! 📊📚)

The impact of the template will depend on how effectively it establishes a minimum meaningful transparency standard for companies that don't otherwise offer any transparency into their handling of e.g. personal data or (anti?-)competitive practices in commercial licensing - we'll see how those play out as new models are released after August 2nd 👀

In the meantime, I wanted to see how the template works for a fully open-source + commercially viable model, so I filled it out for the SmolLM3 - which my colleagues at Hugging Face earlier this month 🤗 ICYMI, it's fully open-source with 3B parameters and performance matching the best similar-size models (I've switched all my local apps from Qwen3 to it, you should too 💡)

Verdict: congrats to the European Commission AI Office for making it so straightforward! Fully open and transparent models remain a cornerstone of informed regulation and governance, but the different organizational needs of their developers aren't always properly accounted for in new regulation. In this case, it took me all of two hours to fill out and publish the template (including reading the guidelines) - so kudos for making it feasible for smaller and distributed organizations 🙌 Definitely a step forward for transparency 🔍

To learn more have a look at:

- The SmolLM3 model: HuggingFaceTB/SmolLM3-3B
- Its filled out Public Summary of Training Content: hfmlsoc/smollm3-eu-data-transparency
- And if you're interested, some previous remarks on regulatory minimum meaningful standards for data disclosure: https://huggingface.co/blog/yjernite/naiac-data-transparency

thomwolf

authored a paper about 2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 69

lvwerra

authored a paper about 2 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 69

yjernite

posted an update 2 months ago

Post

2078

Congrats to the top trending dataset institutional/institutional-books-1.0 !

This is a fantastic example of large-scale curation of public domain books with intentional governance for AI research and use - definitely recommend checking it out, experimenting with the metadata ( institutional/institutional-books-1.0-metadata), and starting to build on top of it 🤗

loubnabnl

authored a paper 3 months ago

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Paper • 2506.05209 • Published Jun 5 • 46

thomwolf

authored a paper 3 months ago

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2 • 128

joanrodai

authored 4 papers 3 months ago

InsightBench: Evaluating Business Analytics Agents Through Multi-Step Insight Generation

Paper • 2407.06423 • Published Jul 8, 2024

UI-Vision: A Desktop-centric GUI Benchmark for Visual Perception and Interaction

Paper • 2503.15661 • Published Mar 19 • 2

StarFlow: Generating Structured Workflow Outputs From Sketch Images

Paper • 2503.21889 • Published Mar 27 • 1

Rendering-Aware Reinforcement Learning for Vector Graphics Generation

Paper • 2505.20793 • Published May 27 • 11

loubnabnl

posted an update 3 months ago

Post

3919

SmolVLM is now available on PocketPal — you can run it offline on your smartphone to interpret the world around you. 🌍📱

And check out this real-time camera demo by @ngxson , powered by llama.cpp:
https://github.com/ngxson/smolvlm-realtime-webcam
https://x.com/pocketpal_ai

3 replies

lewtun

authored a paper 4 months ago

Kimina-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning

Paper • 2504.11354 • Published Apr 15 • 6

joanrodai

authored a paper 4 months ago

Distilling semantically aware orders for autoregressive image generation

Paper • 2504.17069 • Published Apr 23 • 7

yjernite

posted an update 4 months ago

Post

3390

Today in Privacy & AI Tooling - introducing a nifty new tool to examine where data goes in open-source apps on 🤗

HF Spaces have tons (100Ks!) of cool demos leveraging or examining AI systems - and because most of them are OSS we can see exactly how they handle user data 📚🔍

That requires actually reading the code though, which isn't always easy or quick! Good news: code LMs have gotten pretty good at automatic review, so we can offload some of the work - here I'm using Qwen/Qwen2.5-Coder-32B-Instruct to generate reports and it works pretty OK 🙌

The app works in three stages:
1. Download all code files
2. Use the Code LM to generate a detailed report pointing to code where data is transferred/(AI-)processed (screen 1)
3. Summarize the app's main functionality and data journeys (screen 2)
4. Build a Privacy TLDR with those inputs

It comes with a bunch of pre-reviewed apps/Spaces, great to see how many process data locally or through (private) HF endpoints 🤗

Note that this is a POC, lots of exciting work to do to make it more robust, so:
- try it: yjernite/space-privacy
- reach out to collab: yjernite/space-privacy

thomwolf

posted an update 4 months ago

Post

6411

If you've followed the progress of robotics in the past 18 months, you've likely noticed how robotics is increasingly becoming the next frontier that AI will unlock.

At Hugging Face—in robotics and across all AI fields—we believe in a future where AI and robots are open-source, transparent, and affordable; community-built and safe; hackable and fun. We've had so much mutual understanding and passion working with the Pollen Robotics team over the past year that we decided to join forces!

You can already find our open-source humanoid robot platform Reachy 2 on the Pollen website and the Pollen community and people here on the hub at

pollen-robotics

We're so excited to build and share more open-source robots with the world in the coming months!

1 reply

thomwolf

authored a paper 5 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197

lewtun

authored a paper 5 months ago

SmolVLM: Redefining small and efficient multimodal models

Paper • 2504.05299 • Published Apr 7 • 197

AI & ML interests

Recent Activity

Team members 16

bigcode-data's activity