Loss exploding to nan
Hello,
I tried to use these hyperparameters to replicate the imagenet results. Before this I tested the hparams for vit_wee and the result almost matched exactly. However, when I use these hparams for the vit_little model, the training progress quickly stalls (around epoch 10-15) and then goes up and explodes to nan (around epoch 20-25).
Is this a problem with the hyperparameters, some internal problem within the timm library or could this even be a problem with my (cuda) installation?
Best regards
Tony
I cloned the main branch on commit https://github.com/huggingface/pytorch-image-models/commit/8d41071da68055f128a01070a998712e637c4963
I just now see that timm uses a tagging system instead of dev branches, so I will try again on release 1.0.19. If this issue was known at that time I would still appreciate you linking me to a github issue or pr
@tony0278611 little things can make things unstable, and when you're on the edge different runs on different systems or different pytorch versions can blow up or stay okay.
Could try:
- revise LR down a bit
- lower grad clipping to 1.0
- use bfloat16 as the low precision dtype if you're on Ampere or later (
--amp-dtype bfloat16
) - change optimizer from
nadamw
->adamw
... nadamw does show a bit more variability in stability - change your seed
Thanks a lot for your advice, I will try the amp-dtype first. Am I understanding correctly, that the params are chosen to be on the edge of stability and that nan losses aren't necessarily a sign of something wrong with my setup?
Changing to bfloat16 did do the trick! I guess the lr was not the problem but vanishing gradients
Yeah, I find being close(ish) to being unstable at the max LR for the schedule yields good results. But that means you may have to tweak things a bit across varying setups...
Thanks for your help. I ended up tweaking all parameters you mentioned (lowered lr from 8e-4 to 6e-4) and it seems to be stable. I will report the end result (training mediumd now instead of little because the cost effectiveness seemed way better).
Is there some information on how you performed the hyperparameter search? In the standard training script, the in1k validation set is used for validation in the training process. Did you split a different validation set to perform the hyperparameter search to avoid overfitting on the validation set?
Specifically your advice to
- change your seed
confused me a little, since I am trying to look for modifications that perform well under various seeds.
Best regard and thank you once again.
Tony