Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Sign Up
BramVanroy 
posted an update 5 days ago
Post
423
Thanks to popular request, I've just added two subsets to the CommonCrawl-Creative Commons Corpus (C5; BramVanroy/CommonCrawl-CreativeCommons) so that you do not have to do filtering manually

- C5f ( BramVanroy/CommonCrawl-CreativeCommons-fine): only retains high-quality samples that are also present in FineWeb or FineWeb-2;
- C5r ( BramVanroy/CommonCrawl-CreativeCommons-recommended): additional strict filtering that removes samples with license disagreement, non-commercial licenses, and Wikipedia samples. The latter because you should probably get those from a more reliable source that provides better parsed content.

It goes without saying that these filters lead to a massive reduction in quantity. Doc and token counts are given on the dataset pages.
This comment has been hidden (marked as Spam)
This comment has been hidden (marked as Spam)
This comment has been hidden (marked as Spam)
·
This comment has been hidden (marked as Spam)