Common Pile

community

AI & ML interests

None defined yet.

Recent Activity

conceptofmind new activity about 2 months ago

common-pile/caselaw_access_project:Set this up with ApertureDB Croissant ingestion and build RAG

baber updated a dataset about 2 months ago

common-pile/raw_v0.1_parquet

baber published a dataset about 2 months ago

common-pile/raw_v0.1_parquet

View all activity

Articles

Announcing the Common Pile and Comma v0.1

Organization Card

Community About org cards

The Common Pile

We are a group of researchers working together to collect and curate openly licensed and public domain data for training large language models. So far, we have released:

The Common Pile v0.1, an 8 TB dataset of text from over 30 diverse sources
Our paper: The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text
Comma v0.1-1T and Comma v0.1-2T, 7B parameter LLMs trained on text from the Common Pile v0.1
The training dataset used to train the Comma v0.1 models
Our code for collecting data from each source

If you're interested in contributing, please open an issue on GitHub!

Collections 4

View 4 collections

models 3

common-pile/comma-v0.1-2t

7B • Updated Jun 6 • 64 • 30

common-pile/comma-v0.1-1t

7B • Updated Jun 6 • 1.15k • 24

common-pile/comma-v0.1-2t-checkpoints

Updated Jun 5 • 3

datasets 63

common-pile/raw_v0.1_parquet

Viewer • Updated Jul 16 • 2.58B • 4.16k • 1

common-pile/comma_v0.1_training_dataset

Viewer • Updated Jun 6 • 784M • 7.09k • 31

common-pile/youtube_filtered

Viewer • Updated Jun 6 • 986k • 184 • 3

common-pile/youtube

Viewer • Updated Jun 6 • 1.13M • 140 • 9

common-pile/wikiteam_filtered

Viewer • Updated Jun 6 • 10.2M • 232

common-pile/wikiteam

Viewer • Updated Jun 6 • 552M • 1.75k • 2

common-pile/wikimedia_filtered

Viewer • Updated Jun 6 • 12.9M • 519 • 5

common-pile/wikimedia

Viewer • Updated Jun 6 • 78.1M • 194 • 4

common-pile/uspto_filtered

Viewer • Updated Jun 6 • 14.4M • 395 • 3

common-pile/uspto

Viewer • Updated Jun 6 • 16.2M • 830 • 1

View 63 datasets