datasets

#7
by lucyknada - opened

hi there; is there a complete list of datasets used? because e.g. tool calling is not part of any of the linked datasets it seems, and I wonder if anything else was omitted too, thanks!

Hugging Face Smol Models Research org

Hi, we will release all post-training datasets including tool calling in a few days. For the pretraining, you can find the datasets in this collection and the configs with weights in here.

Hugging Face Smol Models Research org

We released all the datasets in SmolTalk2. In the dataset card you can find the full dataset list.

Hi, we will release all post-training datasets including tool calling in a few days. For the pretraining, you can find the datasets in this collection and the configs with weights in here.

Hi, could you clarify where can i find these datasets from stage1_8T.yaml:
- /scratch/smollm3-data-part1/pull-requests
- /scratch/smollm3-data-part1/kaggle
- /scratch/smollm3-data-part1/jupyter-scripts
- /scratch/smollm3-data-part1/github-issues ?

I think, two of them are from HuggingFaceTB/issues-kaggle-notebooks, but what about pull-requests and jupyter-scripts?

Sign up or log in to comment