Common Pile v0.1 All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated 24 days ago • 14 Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated 24 days ago • 13 Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated 24 days ago • 4 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 24 days ago • 42
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated 24 days ago • 14
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated 24 days ago • 13
Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated 24 days ago • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 24 days ago • 42
Common Pile v0.1 Filtered Data An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 common-pile/arxiv_abstracts_filtered Viewer • Updated 24 days ago • 2.5M • 464 • 5 common-pile/arxiv_papers_filtered Viewer • Updated 24 days ago • 309k • 882 common-pile/biodiversity_heritage_library_filtered Viewer • Updated 24 days ago • 16.5M • 521 common-pile/caselaw_access_project_filtered Viewer • Updated 24 days ago • 5.5M • 598 • 1
Common Pile v0.1 Raw Data 8TB of public domain and openly licensed text common-pile/arxiv_abstracts Viewer • Updated 24 days ago • 2.54M • 511 • 6 common-pile/arxiv_papers Viewer • Updated 24 days ago • 317k • 455 • 3 common-pile/biodiversity_heritage_library Viewer • Updated 24 days ago • 45.6M • 462 • 2 common-pile/caselaw_access_project Viewer • Updated 24 days ago • 5.52M • 771 • 1
Comma v0.1 Artifacts A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text common-pile/comma-v0.1-1t 7B • Updated 24 days ago • 2.9k • 20 common-pile/comma-v0.1-2t 7B • Updated 24 days ago • 1.51k • 29 common-pile/comma_v0.1_training_dataset Viewer • Updated 23 days ago • 784M • 12.8k • 31
Common Pile v0.1 All resources related to Common Pile v0.1, an 8TB dataset of public domain and openly licensed text Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated 24 days ago • 14 Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated 24 days ago • 13 Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated 24 days ago • 4 The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 24 days ago • 42
Common Pile v0.1 Raw Data Collection 8TB of public domain and openly licensed text • 30 items • Updated 24 days ago • 14
Common Pile v0.1 Filtered Data Collection An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 • 31 items • Updated 24 days ago • 13
Comma v0.1 Artifacts Collection A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text • 3 items • Updated 24 days ago • 4
The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text Paper • 2506.05209 • Published 24 days ago • 42
Common Pile v0.1 Raw Data 8TB of public domain and openly licensed text common-pile/arxiv_abstracts Viewer • Updated 24 days ago • 2.54M • 511 • 6 common-pile/arxiv_papers Viewer • Updated 24 days ago • 317k • 455 • 3 common-pile/biodiversity_heritage_library Viewer • Updated 24 days ago • 45.6M • 462 • 2 common-pile/caselaw_access_project Viewer • Updated 24 days ago • 5.52M • 771 • 1
Common Pile v0.1 Filtered Data An LLM pre-training dataset produced by filtering and deduplicating the raw text collected in the Common Pile v0.1 common-pile/arxiv_abstracts_filtered Viewer • Updated 24 days ago • 2.5M • 464 • 5 common-pile/arxiv_papers_filtered Viewer • Updated 24 days ago • 309k • 882 common-pile/biodiversity_heritage_library_filtered Viewer • Updated 24 days ago • 16.5M • 521 common-pile/caselaw_access_project_filtered Viewer • Updated 24 days ago • 5.5M • 598 • 1
Comma v0.1 Artifacts A collection of artifacts related to Comma v0.1—a 7B parameter LLM trained on public domain and openly licensed text common-pile/comma-v0.1-1t 7B • Updated 24 days ago • 2.9k • 20 common-pile/comma-v0.1-2t 7B • Updated 24 days ago • 1.51k • 29 common-pile/comma_v0.1_training_dataset Viewer • Updated 23 days ago • 784M • 12.8k • 31