Useless information
Why do so many papers, model cards etc. specify the number of GPUs used for training but fail to mention the duration of the training process? For instance, if 256 H100 GPUs were utilized, was the training completed in 2 seconds, 2 days, or 2 months?
Including the training duration alongside GPU details would provide essential context, particularly for individuals or organizations with fewer resources. It allows them to estimate how long training might take with their setup. If the required time is astronomical, training becomes impractical. Move on, nothing to see here. However, if the duration is reasonable, tinkering, reproducing results, experimenting is possible. Specifying training duration is valuable information.
Hey, here is the approximate gpu hours for the different model size:
- 1.7B (train for 11T): 140k
- 360M (train for 4T): 20k
- 135M (train for 2T): 5k
note that you should be able to get better throughput per gpu by using less nodes as those model are smol.