MobileClip-S2: Image Captioning
MobileClip-S2 is a lightweight image-text matching model designed specifically for mobile and resource-constrained devices. It is an optimized version of the CLIP (Contrastive Language-Image Pre-training) model, capable of mapping image and text semantics into a shared feature space using contrastive learning. This allows for efficient cross-modal retrieval and understanding. Compared to the standard CLIP model, MobileClip-S2 is smaller in size and requires less computational power, making it ideal for fast inference on mobile devices. The model is widely used in image search, image-text matching, and multimodal AI applications, supporting joint processing of images and text to perform tasks such as image classification and image caption generation.
Source model
- Input shape: 1x3x256x256, 1x77
- Number of parameters: 35.7M, 63.4M
- Model size: 141M, 243.77M
- Output shape: 1x512, 1x512
Source model repository: MobileClip-S2
Performance Reference
Please search model by model name in Model Farm
Inference & Model Conversion
Please search model by model name in Model Farm
License
Source Model: MIT
Deployable Model: APLUX-MODEL-FARM-LICENSE