aplux/Beit · Hugging Face

Beit: Image Classification

Beit (Bidirectional Encoder Image Transformer), proposed by Microsoft, is a vision pretraining model that learns image representations via Masked Image Modeling (MIM). It divides images into patches, randomly masks some regions, and uses a Transformer to predict masked visual tokens, mimicking BERT's text pretraining. By employing a visual tokenizer to encode patches into discrete symbols, Beit enhances global context understanding. It outperforms supervised models on ImageNet classification and ADE20K segmentation, supporting self-/semi-supervised training to reduce annotation dependency. Ideal for image understanding, medical imaging, and multimodal extensions, it offers a versatile framework for efficient visual representation learning.

Source model

Input shape: 1x3x224x224
Number of parameters: 82.52M
Model size: 360.02MB
Output shape: 1x1000

The source model can be found here

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Source Model: BSD-3-CLAUSE
Deployable Model: APLUX-MODEL-FARM-LICENSE