Hugging Face
Models
Datasets
Spaces
Posts
Docs
Enterprise
Pricing
Log In
Sign Up
pietrolesci
's Collections
Interesting Pre-Training Datasets
The Pile Companion
Generalisation-Profiles
Machine Translation Datasets
Text Classification Datasets
Dialogue State Tracking Datasets
NLI Eval Datasets
AnchorAL
Memorisation-Profiles
Merge-Effect
Interesting Pre-Training Datasets
updated
21 days ago
Upvote
-
Zyphra/Zyda-2
Viewer
•
Updated
Dec 12, 2024
•
1.62B
•
97.4k
•
82
Note
Look at the preprocessing code:
https://github.com/Zyphra/Zyda_processing
HuggingFaceTB/dclm-edu
Viewer
•
Updated
Mar 7
•
1B
•
43.3k
•
24
HuggingFaceFW/fineweb-edu
Viewer
•
Updated
Jan 31
•
3.3B
•
260k
•
663
HuggingFaceTB/stack-edu
Viewer
•
Updated
27 days ago
•
167M
•
3.4k
•
31
HuggingFaceTB/finemath
Viewer
•
Updated
Feb 6
•
48.3M
•
7.75k
•
303
bigcode/the-stack
Viewer
•
Updated
Apr 13, 2023
•
546M
•
10.7k
•
790
bigcode/the-stack-v2
Viewer
•
Updated
Apr 23, 2024
•
5.45B
•
3.5k
•
353
HuggingFaceTB/smollm-corpus
Viewer
•
Updated
Sep 6, 2024
•
237M
•
14.5k
•
325
mlfoundations/dclm-baseline-1.0
Preview
•
Updated
Jul 22, 2024
•
701k
•
216
Upvote
-
Share collection
View history
Collection guide
Browse collections