GAIR-ProX

community

https://gair-nlp.github.io/ProX/

AI & ML interests

NLP Research

Organization Card

Community About org cards

GAIR-ProX, a subsidiary of GAIR, spearheads the 🫐 ProX Project. This initiative aims to enhance pre-training efficiency by refining corpus documents using language models at scale. Through meticulous operations (e.g., document-level filtering and chunk-level cleaning), implemented as scalable, executable programs, 🫐 ProX seeks to improve pre-training data quality at scale, ultimately developing more robust and efficient language models.

Read our technical report!

Collections 4

View 4 collections

models 14

gair-prox/web-chunk-refining-lm

Text Generation • 0.4B • Updated Oct 10, 2024 • 550 • 7

gair-prox/math-chunk-refining-lm

Text Generation • 0.4B • Updated Oct 10, 2024 • 8 • 1

gair-prox/math-doc-refining-lm

Text Generation • 0.8B • Updated Oct 10, 2024 • 6 • 2

gair-prox/web-doc-refining-lm

Text Generation • 0.4B • Updated Oct 10, 2024 • 16 • 5

gair-prox/RedPJ-ProX-1.7B

2B • Updated Oct 10, 2024 • 4 • 2

gair-prox/RedPJ-ProX-0.3B

0.4B • Updated Oct 10, 2024 • 6 • 3

gair-prox/C4-ProX-1.7B

2B • Updated Oct 10, 2024 • 7 • 1

gair-prox/CodeLlama-7B-ProXMath

Updated Oct 10, 2024 • 3 • 1

gair-prox/TinyLlama-1.1B-ProXMath

1B • Updated Oct 10, 2024 • 4 • 2

gair-prox/Llama-2-7B-ProXMath

Text Generation • Updated Oct 10, 2024 • 3 • 1

datasets 5

gair-prox/DCLM-pro

Viewer • Updated Feb 15, 2025 • 366M • 3.29k • 12

gair-prox/RedPajama-pro

Viewer • Updated Sep 26, 2024 • 10.2M • 219 • 4

gair-prox/c4-pro

Viewer • Updated Sep 26, 2024 • 40.1M • 265 • 7

gair-prox/open-web-math-pro

Viewer • Updated Sep 26, 2024 • 2.58M • 350 • 12

gair-prox/FineWeb-pro

Viewer • Updated Sep 26, 2024 • 63.1M • 983 • 26