datasets
hi there; is there a complete list of datasets used? because e.g. tool calling is not part of any of the linked datasets it seems, and I wonder if anything else was omitted too, thanks!
Hi, we will release all post-training datasets including tool calling in a few days. For the pretraining, you can find the datasets in this collection and the configs with weights in here.
Thanks!
Hi, we will release all post-training datasets including tool calling in a few days. For the pretraining, you can find the datasets in this collection and the configs with weights in here.
Hi, could you clarify where can i find these datasets from stage1_8T.yaml:
- /scratch/smollm3-data-part1/pull-requests
- /scratch/smollm3-data-part1/kaggle
- /scratch/smollm3-data-part1/jupyter-scripts
- /scratch/smollm3-data-part1/github-issues ?
I think, two of them are from HuggingFaceTB/issues-kaggle-notebooks, but what about pull-requests and jupyter-scripts?