FineWeb-C: A Community-Driven Dataset for Educational Quality Annotations in 122 Languages

Community Article Published July 8, 2025

Upvote

Last year we launched FineWeb-C. A community driven dataset focused on developing educational quality classifiers to help create better open LLMs in more languages". In this post we're excited to share the resulting dataset built by the community.

tl;dr: There are now over 50,000 annotations across 122 languages in the FineWeb-C dataset, focused on identifying educational quality content on the web.

You can access the dataset on Hugging Face. You can load all of the annotations using the load_dataset function from the datasets library:

from datasets import load_dataset
dataset = load_dataset("data-is-better-together/fineweb-c")

or load a specific language configuration like this:

from datasets import load_dataset

dataset = load_dataset("data-is-better-together/fineweb-c", "tatar")

By the Numbers

465 contributors from around the world
58,185 total annotations
122 languages covered
14 Diamond contributors (1000+ annotations)
18 Gold contributors (500-999 annotations)
65 Silver contributors (100-499 annotations)
368 Bronze contributors (1-99 annotations)

Building a Multilingual Dataset Together

The dataset expands upon the multilingual FineWeb2 dataset.

From Tatar (3,015 annotations) to Vietnamese (2,869), Danish (2,573) to Tigrinya (1,837), communities worldwide stepped up to label web content in their languages. Some languages attracted dozens of annotators, while others were driven by dedicated individuals contributing thousands of labels.

Top contributors like Stefan-it (4,614), tagayin (2,094), and hannayukhymenko (1,937) led the way, but every single annotation mattered in building this resource.

A Blueprint for the Future

FineWeb-C proves something important: communities can build the language resources they need. No single company understands the nuances of 122 languages, but collectively, we do!

The model of communities annotating data to train classifiers that filter web-scale datasets could transform how we build multilingual AI. We are excited to see similar efforts for domain-specific content, cultural context, or other dimensions of data quality.

While the main annotation effort has concluded, you can explore the dataset at data-is-better-together/fineweb-c. Join the Discord to connect with others using the data and share what you build.

We're excited to see how FineWeb-C will be used to improve educational content quality across languages and to see the community develop similar efforts in the future!

Acknowledgments

We want to thank all the contributors who made FineWeb-C possible. Your efforts in annotating content in your languages have created a valuable resource for the entire community. Special thanks to the top contributors who led the way, and to everyone who participated, no matter how many annotations you made.

Below is a preview of a Space where you can see the contributors and the number of annotations they made:

Community

StasClear

about 16 hours ago

Cool

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote