Interesting work! I want to use the alignment between images and text in the encoder of this model for downstream tasks. How should I use it?
+1, is it possible to use only visual encoder to do downstream task? like classification
· Sign up or log in to comment