Essential-Web v1.0: 24T tokens of organized web data
Abstract
A large, 24-trillion-token Essential-Web v1.0 dataset annotated with a multi-category taxonomy outperforms or is competitive with existing datasets in various domains using simple filtering techniques.
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0
Community
ESSENTIAL-WEB V1.0: 24T tokens of organized web data
Amazing work!! 🔥
Great work!
Fantastic!
Impressive!!!!!
Models citing this paper 1
Datasets citing this paper 8
Browse 8 datasets citing this paperSpaces citing this paper 0
No Space linking this paper