Post
1800
Data quality is the new frontier for LLM performance.
Ultra-FineWeb π a high-quality bilingual dataset released by OpenBMB
openbmb/Ultra-FineWeb
β¨ MIT License
β¨ 1T English + 120B Chinese tokens
β¨ Efficient model-driven filtering
Ultra-FineWeb π a high-quality bilingual dataset released by OpenBMB
openbmb/Ultra-FineWeb
β¨ MIT License
β¨ 1T English + 120B Chinese tokens
β¨ Efficient model-driven filtering