HuggingFaceTB/SmolLM2-1.7B-Instruct

adeo

14 days ago

•

Why do so many papers, model cards etc. specify the number of GPUs used for training but fail to mention the duration of the training process? For instance, if 256 H100 GPUs were utilized, was the training completed in 2 seconds, 2 days, or 2 months?

Including the training duration alongside GPU details would provide essential context, particularly for individuals or organizations with fewer resources. It allows them to estimate how long training might take with their setup. If the required time is astronomical, training becomes impractical. Move on, nothing to see here. However, if the duration is reasonable, tinkering, reproducing results, experimenting is possible. Specifying training duration is valuable information.

eliebak

Hugging Face TB Research org 14 days ago

Hey, here is the approximate gpu hours for the different model size:

1.7B (train for 11T): 140k
360M (train for 4T): 20k
135M (train for 2T): 5k
note that you should be able to get better throughput per gpu by using less nodes as those model are smol.

eliebak changed discussion status to closed 14 days ago

HuggingFaceTB
/

SmolLM2-1.7B-Instruct

Useless information