aplux/MobileClip-S2 · Hugging Face

MobileClip-S2: Image Captioning

MobileClip-S2 is a lightweight image-text matching model designed specifically for mobile and resource-constrained devices. It is an optimized version of the CLIP (Contrastive Language-Image Pre-training) model, capable of mapping image and text semantics into a shared feature space using contrastive learning. This allows for efficient cross-modal retrieval and understanding. Compared to the standard CLIP model, MobileClip-S2 is smaller in size and requires less computational power, making it ideal for fast inference on mobile devices. The model is widely used in image search, image-text matching, and multimodal AI applications, supporting joint processing of images and text to perform tasks such as image classification and image caption generation.

Source model

Input shape: 1x3x256x256, 1x77
Number of parameters: 35.7M, 63.4M
Model size: 141M, 243.77M
Output shape: 1x512, 1x512

Source model repository: MobileClip-S2

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Source Model: MIT
Deployable Model: APLUX-MODEL-FARM-LICENSE