HuggingFaceFW/ablation-exp-textext-warc_trafilatura-28BT
Text Generation
β’
2B
β’
Updated
β’
16
β’
1
We release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)
FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale