All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text

Common Pile
community
AI & ML interests
None defined yet.
Recent Activity
Organization Card
The Common Pile
We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models. So far, we have released:
- The Common Pile v0.1, an 8 TB dataset of text from over 30 diverse sources
- Our paper: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
- Comma v0.1-1T and Comma v0.1-2T, 7B parameter LLMs trained on text from the Common Pile v0.1
- The training dataset used to train the Comma v0.1 models
- Our code for collecting data from each source
If you're interested in contributing, please open an issue on GitHub!
Collections
4
models
3
datasets
62
common-pile/youtube_filtered
Viewer
•
Updated
•
986k
•
86
common-pile/youtube
Viewer
•
Updated
•
1.13M
•
152
•
3
common-pile/wikiteam_filtered
Viewer
•
Updated
•
10.2M
•
95
common-pile/wikiteam
Viewer
•
Updated
•
552M
•
343
common-pile/wikimedia_filtered
Viewer
•
Updated
•
12.9M
•
102
common-pile/wikimedia
Viewer
•
Updated
•
78.1M
•
138
common-pile/uspto_filtered
Viewer
•
Updated
•
14.4M
•
452
common-pile/uspto
Viewer
•
Updated
•
16.2M
•
470
common-pile/usgpo_filtered
Viewer
•
Updated
•
2.34M
•
86
common-pile/usgpo
Viewer
•
Updated
•
3.75M
•
198