FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language Paper • 2506.20920 • Published Jun 26 • 64
How Programming Concepts and Neurons Are Shared in Code Language Models Paper • 2506.01074 • Published Jun 1 • 3
Tracing Multilingual Factual Knowledge Acquisition in Pretraining Paper • 2505.14824 • Published May 20 • 4
GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages Paper • 2410.23825 • Published Oct 31, 2024 • 4
MEXA: Multilingual Evaluation of English-Centric LLMs via Cross-Lingual Alignment Paper • 2410.05873 • Published Oct 8, 2024 • 3