Beit: Image Classification
Beit (Bidirectional Encoder Image Transformer), proposed by Microsoft, is a vision pretraining model that learns image representations via Masked Image Modeling (MIM). It divides images into patches, randomly masks some regions, and uses a Transformer to predict masked visual tokens, mimicking BERT's text pretraining. By employing a visual tokenizer to encode patches into discrete symbols, Beit enhances global context understanding. It outperforms supervised models on ImageNet classification and ADE20K segmentation, supporting self-/semi-supervised training to reduce annotation dependency. Ideal for image understanding, medical imaging, and multimodal extensions, it offers a versatile framework for efficient visual representation learning.
Source model
- Input shape: 1x3x224x224
- Number of parameters: 82.52M
- Model size: 360.02MB
- Output shape: 1x1000
The source model can be found here
Performance Reference
Please search model by model name in Model Farm
Inference & Model Conversion
Please search model by model name in Model Farm
License
Source Model: BSD-3-CLAUSE
Deployable Model: APLUX-MODEL-FARM-LICENSE