Introducing ๐๐ ๐ข๐ง๐๐๐๐ญ๐ก: the best public math pre-training dataset with 50B+ tokens! HuggingFaceTB/finemath
Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.
We build the dataset by: ๐ ๏ธ carefully extracting math data from Common Crawl; ๐ iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.
We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.
We hope this helps advance the performance of LLMs on math and reasoning! ๐ Weโre also releasing all the ablation models as well as the evaluation code.
We applied the same data-driven approach that led to SOTA English performance in๐ท FineWeb to thousands of languages.
๐ฅ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.
The dataset is released under the permissive ๐ ODC-By 1.0 license, and the ๐ป code to reproduce it and our evaluations is public.
We will very soon announce a big community project, and are working on a ๐ blogpost walking you through the entire dataset creation process. Stay tuned!
The latest o1 model from OpenAI is still unable to answer 9.11 > 9.9 correctly ๐ค
A possible explanation? Tokenization - and our latest work investigates how it affects a model's ability to do math!
In this blog post, we discuss: ๐ข The different ways numbers are tokenized in modern LLMs ๐งช Our detailed approach in comparing these various methods ๐ฅช How we got a free boost in arithmetic performance by adding a few lines of code to the base Llama 3 tokenizer ๐ and a definitive, best tokenization method for math in LLMs!
Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week ๐)!
๐น Bringing objectivity to comparisons
Existing comparisons of number tokenization methods often ignore the difference in modelsโ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more โlearningโ being done by these bigger models.
We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts.
๐น Key results
We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints):
- When splitting evals based on operators, single-digit tokenization consistently outperformed other methods. - Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks.
All in all, single-digit tokenization is best compared to other methods, and similar to our previous postโs finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest!
The wait is almost over ๐ค, the full report is coming next week - stay tuned!
- Pre-training code with nanotron - Evaluation suite with lighteval - Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk) - Post-training scripts with TRL & the alignment handbook - On-device tools with llama.cpp for summarization, rewriting & agents
Apache 2.0 licensed. V2 pre-training data mix coming soon!
How do I test an LLM for my unique needs? If you work in finance, law, or medicine, generic benchmarks are not enough. This blog post uses Argilla, Distilllabel and ๐ค๏ธLighteval to generate evaluation dataset and evaluate models.