FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages
Last year we launched FineWeb-C. A community driven dataset focused on developing educational quality classifiers to help create better open LLMs in more languages". In this post we're excited to share the resulting dataset built by the community.
tl;dr: There are now over 50,000 annotations across 122 languages in the FineWeb-C dataset, focused on identifying educational quality content on the web.
You can access the dataset on Hugging Face. You can load all of the annotations using the load_dataset
function from the datasets
library:
from datasets import load_dataset
dataset = load_dataset("data-is-better-together/fineweb-c")
or load a specific language configuration like this:
from datasets import load_dataset
dataset = load_dataset("data-is-better-together/fineweb-c", "tatar")
By the Numbers
- 465 contributors from around the world
- 58,185 total annotations
- 122 languages covered
- 14 Diamond contributors (1000+ annotations)
- 18 Gold contributors (500-999 annotations)
- 65 Silver contributors (100-499 annotations)
- 368 Bronze contributors (1-99 annotations)
Building a Multilingual Dataset Together
The dataset expands upon the multilingual FineWeb2 dataset.
From Tatar (3,015 annotations) to Vietnamese (2,869), Danish (2,573) to Tigrinya (1,837), communities worldwide stepped up to label web content in their languages. Some languages attracted dozens of annotators, while others were driven by dedicated individuals contributing thousands of labels.
Top contributors like Stefan-it (4,614), tagayin (2,094), and hannayukhymenko (1,937) led the way, but every single annotation mattered in building this resource.
A Blueprint for the Future
FineWeb-C proves something important: communities can build the language resources they need. No single company understands the nuances of 122 languages, but collectively, we do!
The model of communities annotating data to train classifiers that filter web-scale datasets could transform how we build multilingual AI. We are excited to see similar efforts for domain-specific content, cultural context, or other dimensions of data quality.
While the main annotation effort has concluded, you can explore the dataset at data-is-better-together/fineweb-c. Join the Discord to connect with others using the data and share what you build.
We're excited to see how FineWeb-C will be used to improve educational content quality across languages and to see the community develop similar efforts in the future!
Acknowledgments
We want to thank all the contributors who made FineWeb-C possible. Your efforts in annotating content in your languages have created a valuable resource for the entire community. Special thanks to the top contributors who led the way, and to everyone who participated, no matter how many annotations you made.
Below is a preview of a Space where you can see the contributors and the number of annotations they made: