Common Pile v0.1 All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18 Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17 Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17
Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
Common Pile v0.1 Filtered Data An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 common-pile/arxiv_abstracts_filtered Viewer • Updated Jun 6 • 2.5M • 145 • 5 common-pile/arxiv_papers_filtered Viewer • Updated Jun 6 • 309k • 257 common-pile/biodiversity_heritage_library_filtered Viewer • Updated Jun 6 • 16.5M • 502 • 1 common-pile/caselaw_access_project_filtered Viewer • Updated Jun 6 • 5.5M • 919 • 4
Common Pile v0.1 Raw Data 8TB of public domain and openly licensed text common-pile/arxiv_abstracts Viewer • Updated Jun 6 • 2.54M • 86 • 6 common-pile/arxiv_papers Viewer • Updated Jun 6 • 317k • 603 • 10 common-pile/biodiversity_heritage_library Viewer • Updated Jun 6 • 45.6M • 98 • 2 common-pile/caselaw_access_project Viewer • Updated Jun 6 • 5.52M • 7.33k • 186
Comma v0.1 Artifacts A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text common-pile/comma-v0.1-1t 7B • Updated Jun 6 • 1.13k • 24 common-pile/comma-v0.1-2t 7B • Updated Jun 6 • 393 • 30 common-pile/comma_v0.1_training_dataset Viewer • Updated Jun 6 • 784M • 3.84k • 31
Common Pile v0.1 All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18 Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17 Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated Jun 6 • 18
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated Jun 6 • 17
Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated Jun 6 • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published Jun 5 • 46
Common Pile v0.1 Raw Data 8TB of public domain and openly licensed text common-pile/arxiv_abstracts Viewer • Updated Jun 6 • 2.54M • 86 • 6 common-pile/arxiv_papers Viewer • Updated Jun 6 • 317k • 603 • 10 common-pile/biodiversity_heritage_library Viewer • Updated Jun 6 • 45.6M • 98 • 2 common-pile/caselaw_access_project Viewer • Updated Jun 6 • 5.52M • 7.33k • 186
Common Pile v0.1 Filtered Data An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 common-pile/arxiv_abstracts_filtered Viewer • Updated Jun 6 • 2.5M • 145 • 5 common-pile/arxiv_papers_filtered Viewer • Updated Jun 6 • 309k • 257 common-pile/biodiversity_heritage_library_filtered Viewer • Updated Jun 6 • 16.5M • 502 • 1 common-pile/caselaw_access_project_filtered Viewer • Updated Jun 6 • 5.5M • 919 • 4
Comma v0.1 Artifacts A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text common-pile/comma-v0.1-1t 7B • Updated Jun 6 • 1.13k • 24 common-pile/comma-v0.1-2t 7B • Updated Jun 6 • 393 • 30 common-pile/comma_v0.1_training_dataset Viewer • Updated Jun 6 • 784M • 3.84k • 31