Masking the image tokens during training
#68
by
jchiu1234
- opened
Should you mask the image tokens in the decoder output during training? I'm trying to wrap my head around this. An argument for why you wouldn't is that maybe you want the model to know the previous inputs are images? Could someone suggest whether the model should be trained with the image tokens masked or not?