Just read VitMAE paper, sharing some highlights 🧶 ViTMAE is a simply yet effective self-supervised pre-training technique, where authors combined vision transformer with masked autoencoder. The images are first masked (75 percent of the image!) and then the model tries to learn about the features through trying to reconstruct the original image! ![image_1](image_1.jpg) The image is not masked, but rather only the visible patches are fed to the encoder (and that is the only thing encoder sees!). Next, a mask token is added to where the masked patches are (a bit like BERT, if you will) and the mask tokens and encoded patches are fed to decoder. The decoder then tries to reconstruct the original image. ![image_2](image_2.jpg) As a result, the authors found out that high masking ratio works well in fine-tuning for downstream tasks and linear probing 🤯🤯 ![image_3](image_3.jpg) If you want to try the model or fine-tune, all the pre-trained VITMAE models released released by Meta are available on [Huggingface](https://t.co/didvTL9Zkm). We've built a [demo](https://t.co/PkuACJiKrB) for you to see the intermediate outputs and reconstruction by VITMAE. Also there's a nice [notebook](https://github.com/NielsRogge/Transformers-Tutorials/blob/master/ViTMAE/ViT_MAE_visualization_demo.ipynb) by [@NielsRogge](https://twitter.com/NielsRogge). ![image_4](image_4.jpg) > [!TIP] Ressources: [Masked Autoencoders Are Scalable Vision Learners](https://arxiv.org/abs/2111.06377v3) by LKaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick (2021) [GitHub](https://github.com/facebookresearch/mae) [Hugging Face documentation](https://huggingface.co/docs/transformers/model_doc/vit_mae) > [!NOTE] [Original tweet](https://twitter.com/mervenoyann/status/1740688304784183664) (December 29, 2023)