Papers
arxiv:2506.14111

Essential-Web v1.0: 24T tokens of organized web data

Published on Jun 17
· Submitted by Research-EAI on Jun 17
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A large, 24-trillion-token Essential-Web v1.0 dataset annotated with a multi-category taxonomy outperforms or is competitive with existing datasets in various domains using simple filtering techniques.

AI-generated summary

Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines. We present Essential-Web v1.0, a 24-trillion-token dataset in which every document is annotated with a twelve-category taxonomy covering topic, format, content complexity, and quality. Taxonomy labels are produced by EAI-Distill-0.5b, a fine-tuned 0.5b-parameter model that achieves an annotator agreement within 3% of Qwen2.5-32B-Instruct. With nothing more than SQL-style filters, we obtain competitive web-curated datasets in math (-8.0% relative to SOTA), web code (+14.3%), STEM (+24.5%) and medical (+8.6%). Essential-Web v1.0 is available on HuggingFace: https://huggingface.co/datasets/EssentialAI/essential-web-v1.0

Community

Paper submitter

ESSENTIAL-WEB V1.0: 24T tokens of organized web data

Amazing work!! 🔥

Great work!

Fantastic!

Impressive!!!!!

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 8

Browse 8 datasets citing this paper

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2506.14111 in a Space README.md to link it from this page.

Collections including this paper 3