NaN values when used as embedding
I'm using this as an image embedding for a vision language model, but when I train it I get NaN values returned from this. I know it's not a problem with my hyperparameters because it occurs before the first backpropagation.
Transforms:
Compose(
[
Resize(size=(384, 384)),
ToImage(),
ToDtype(torch.float32),
Normalize(mean=[1., 1., 1.], std=[1., 1., 1.]),
]
)
I really don't know what other information to provide, just let me know if you need anything in specific.
Have you tried with the default Normalization values? Normalize(mean=[1., 1., 1.], std=[1., 1., 1.]),
seems off:
https://huggingface.co/timm/ViT-SO400M-14-SigLIP-384/blob/main/open_clip_config.json#L32
Ideally, you should use the preprocessor provided with the model. Check object preprocess
in the model card
I have tried multiple normalize values for the preprocessor, including the default, but all return NaN values. I took the same preprocessor config found in moondream (which uses this model), and it works so I don't know why it wouldn't work for me
Did you mean ToTensor
instead of ToImage
in your code?
Did you mean
ToTensor
instead ofToImage
in your code?
I’ll try doing that today, but the transforms are already turning the image into a tensor, so I believe it might have to do with something else. I’ll put that on my to-do list. For now I’ll try to find other possible fixes.
Can you provide a reproducible example (notebook preferred)? Shouldn't be too hard to debug
@Locutusque If it's just forward as you say, a pickle of an input batch that generates the NaN would be helpful, or a concise code only repro with random tensors, etc. I have not seen this issue and haven't heard about it from anyone else. The model appears stable w/ inputs of much higher magnitude than a normalized image.
@Locutusque If it's just forward as you say, a pickle of an input batch that generates the NaN would be helpful, or a concise code only repro with random tensors, etc. I have not seen this issue and haven't heard about it from anyone else. The model appears stable w/ inputs of much higher magnitude than a normalized image.
Most images from HaoyeZhang/RLHF-V-Dataset cause the NaN values, except a few where I get a lucky shuffle and the training goes fine. 49 times out of 50 it returns NaN values. I can’t really show much code here because I’m not home and don’t have the source, but I can try to describe what it does. The model class extends a causal lm model classs, and adds a vision embedding, which consists of this model, and an MLP. The NaNs come directly from this model (before the MLP), I printed the tensors, and all of the elements were NaN. Since you mentioned the model is stable without normalization, might the values be too small? I’ll try to get rid of the normalization.
@Locutusque unless you can isolate the vision model, unmodified and show that a valid input causes a NaN in forward there is very little likelihood that it's this model (irrespective of training hparams and training details) causing the NaN.
You said it's 'not training' but if the NaN only happens while training then training is likely making the weights unstable as they are being updated (causing large increases in magnitude, etc). So it does seem that it is hparams, data, etc, otherwise it'd be possible to isolate an valid input to forward that would cause the unmodified vision tower to NaN without involving any training process.
It turns out the weights were being reset when I was loading the language model with this model under the same module. All I had to do was load them separately to fix it. Thanks for the help!