Beit: Image Classification

Beit (Bidirectional Encoder Image Transformer), proposed by Microsoft, is a vision pretraining model that learns image representations via Masked Image Modeling (MIM). It divides images into patches, randomly masks some regions, and uses a Transformer to predict masked visual tokens, mimicking BERT's text pretraining. By employing a visual tokenizer to encode patches into discrete symbols, Beit enhances global context understanding. It outperforms supervised models on ImageNet classification and ADE20K segmentation, supporting self-/semi-supervised training to reduce annotation dependency. Ideal for image understanding, medical imaging, and multimodal extensions, it offers a versatile framework for efficient visual representation learning.

Source model

  • Input shape: 1x3x224x224
  • Number of parameters: 82.52M
  • Model size: 360.02MB
  • Output shape: 1x1000

The source model can be found here

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support