HuggingFaceFW

Enterprise
community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

anton-lย  updated a dataset 5 days ago
HuggingFaceFW/fineweb-edu
guipenedoย  updated a Space 7 days ago
HuggingFaceFW/blogpost-fineweb-v1
guipenedoย  updated a dataset 7 days ago
HuggingFaceFW/fineweb
View all activity

HuggingFaceFW's activity

anton-lย 
posted an update 6 days ago
view post
Post
1965
Introducing ๐Ÿ“๐…๐ข๐ง๐ž๐Œ๐š๐ญ๐ก: the best public math pre-training dataset with 50B+ tokens!
HuggingFaceTB/finemath

Math remains challenging for LLMs and by training on FineMath we see considerable gains over other math datasets, especially on GSM8K and MATH.

We build the dataset by:
๐Ÿ› ๏ธ carefully extracting math data from Common Crawl;
๐Ÿ”Ž iteratively filtering and recalling high quality math pages using a classifier trained on synthetic annotations to identify math reasoning and deduction.

We conducted a series of ablations comparing the performance of Llama-3.2-3B-Base after continued pre-training on FineMath and observe notable gains compared to the baseline model and other public math datasets.

We hope this helps advance the performance of LLMs on math and reasoning! ๐Ÿš€
Weโ€™re also releasing all the ablation models as well as the evaluation code.

HuggingFaceTB/finemath-6763fb8f71b6439b653482c2
thomwolfย 
posted an update 16 days ago
view post
Post
4333
We are proud to announce HuggingFaceFW/fineweb-2: A sparkling update to HuggingFaceFW/fineweb with 1000s of ๐Ÿ—ฃ๏ธlanguages.

We applied the same data-driven approach that led to SOTA English performance in๐Ÿท FineWeb to thousands of languages.

๐Ÿฅ‚ FineWeb2 has 8TB of compressed text data and outperforms other multilingual datasets in our experiments.

The dataset is released under the permissive ๐Ÿ“œ ODC-By 1.0 license, and the ๐Ÿ’ป code to reproduce it and our evaluations is public.

We will very soon announce a big community project, and are working on a ๐Ÿ“ blogpost walking you through the entire dataset creation process. Stay tuned!

In the mean time come ask us question on our chat place: HuggingFaceFW/discussion

H/t @guipenedo @hynky @lvwerra as well as @vsabolcec Bettina Messmer @negar-foroutan and @mjaggi
  • 2 replies
ยท
garrethleeย 
posted an update 19 days ago
view post
Post
1887
The latest o1 model from OpenAI is still unable to answer 9.11 > 9.9 correctly ๐Ÿค”

A possible explanation? Tokenization - and our latest work investigates how it affects a model's ability to do math!

In this blog post, we discuss:
๐Ÿ”ข The different ways numbers are tokenized in modern LLMs
๐Ÿงช Our detailed approach in comparing these various methods
๐Ÿฅช How we got a free boost in arithmetic performance by adding a few lines of code to the base Llama 3 tokenizer
๐Ÿ‘‘ and a definitive, best tokenization method for math in LLMs!

Check out our work here: huggingface/number-tokenization-blog
  • 2 replies
ยท
thomwolfย 
posted an update 19 days ago
thomwolfย 
posted an update 21 days ago
garrethleeย 
posted an update 27 days ago
view post
Post
360
Does tokenizing numbers into single digits outperform three-digit or BPE tokenization for arithmetic tasks? We explore various tokenization methods in our upcoming blog (releasing next week ๐Ÿ‘€)!

๐Ÿ”น Bringing objectivity to comparisons

Existing comparisons of number tokenization methods often ignore the difference in modelsโ€™ compute budgets: larger tokenizer vocabularies naturally lead to more parameters, which produces less objective comparisons of model performances due to more โ€œlearningโ€ being done by these bigger models.

We addressed this by keeping architectures consistent but adjusting the number of hidden layers to produce roughly equal parameter counts.

๐Ÿ”น Key results

We trained models on the same data mix and evaluated their performance on various arithmetic tasks (digits, operations, floats vs. ints):

- When splitting evals based on operators, single-digit tokenization consistently outperformed other methods.
- Right-to-left tokenization (which I covered in a previous post) matched or exceeded left-to-right approaches in all tasks.

All in all, single-digit tokenization is best compared to other methods, and similar to our previous postโ€™s finding, R2L works better than L2R tokenization, although not as significant as the gap between single-digit and the rest!

The wait is almost over ๐Ÿค—, the full report is coming next week - stay tuned!
loubnabnlย 
posted an update about 1 month ago
view post
Post
1637
Making SmolLM2 reproducible: open-sourcing our training & evaluation toolkit ๐Ÿ› ๏ธ https://github.com/huggingface/smollm/

- Pre-training code with nanotron
- Evaluation suite with lighteval
- Synthetic data generation using distilabel (powers our new SFT dataset HuggingFaceTB/smoltalk)
- Post-training scripts with TRL & the alignment handbook
- On-device tools with llama.cpp for summarization, rewriting & agents

Apache 2.0 licensed. V2 pre-training data mix coming soon!

Which other tools should we add next?
thomwolfย 
posted an update about 1 month ago
SaylorTwiftย 
posted an update about 1 month ago
thomwolfย 
posted an update about 1 month ago
thomwolfย 
posted an update 2 months ago
view post
Post
4114
Parents in the 1990: Teach the kids to code
Parents now: Teach the kids to fix the code when it starts walking around ๐Ÿค–โœจ
  • 2 replies
ยท