Post
5332
🍷 FineWeb technical report is out and so is 📚 FineWeb-Edu, a 1.3 trillion tokens dataset that outperforms all other open web datasets, with remarkable improvements on educational benchmarks such as MMLU, ARC, and OpenBookQA.
Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu
We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.
You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.
Enjoy!
Technical report: HuggingFaceFW/blogpost-fineweb-v1
Dataset: HuggingFaceFW/fineweb-edu
We used Llama 3 generations to train an educational quality classifier, filtering the 15 trillion tokens of FineWeb to select only those with high educational value (an approach also used in Llama 3 and Phi-3 training datasets). We're releasing both FineWeb-Edu and the classifier, along with a larger, less heavily filtered version containing 5.4 trillion tokens.
You can find more details about the dataset and the experiments we ran in the FineWeb technical report, It's a 45-minute read but it contains all the secret sauce for building high quality web datasets.
Enjoy!