Cosmopedia: how to create large-scale synthetic data for pre-training Large Language Models Mar 20 β’ 69
Introducing IDEFICS: An Open Reproduction of State-of-the-art Visual Language Model Aug 22, 2023 β’ 28
Huggy Lingo: Using Machine Learning to Improve Language Metadata on the Hugging Face Hub Aug 2, 2023 β’ 1
data-is-better-together/fineweb-c-progress Viewer β’ Updated about 6 hours ago β’ 668 β’ 556 β’ 2
librarian-bots/dataset_cards_with_metadata Viewer β’ Updated about 16 hours ago β’ 195k β’ 175 β’ 11
view article Article FineWeb2-C: Help Build Better Language Models in Your Language By davanstrien β’ 2 days ago β’ 10
view post Post 1512 Introducing FineWeb-C ππ, a community-built dataset for improving language models in ALL languages.Inspired by FineWeb-Edu the community is labelling the educational quality of texts for many languages.318 annotators, 32K+ annotations, 12 languages - and growing! π data-is-better-together/fineweb-c See translation π₯ 4 4 + Reply