view article Article nanoJAXGPT: A pedagogical introduction to JAX/Equinox By sachithgunasekara and 2 others • Oct 23, 2024 • 5
CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training Paper • 2504.13161 • Published Apr 17 • 92
Running 116 116 TxT360: Trillion Extracted Text 📖 Create a large-scale deduplicated text dataset for LLM training
Running 2.84k 2.84k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters
Running 116 116 TxT360: Trillion Extracted Text 📖 Create a large-scale deduplicated text dataset for LLM training
Running 68 68 Scaling FineWeb to 1000+ languages: Step 1: finding signal in 100s of evaluation tasks 📝 Evaluate multilingual models using FineTasks
view article Article Open-R1: a fully open reproduction of DeepSeek-R1 By eliebak and 2 others • Jan 28 • 877
view article Article Scaling AI-based Data Processing with Hugging Face + Dask By scj13 and 3 others • Oct 9, 2024 • 31
Running 1.01k 1.01k FineWeb: decanting the web for the finest text data at scale 🍷 Generate high-quality web text data for LLM training