MobileClip-S2: Image Captioning

MobileClip-S2 is a lightweight image-text matching model designed specifically for mobile and resource-constrained devices. It is an optimized version of the CLIP (Contrastive Language-Image Pre-training) model, capable of mapping image and text semantics into a shared feature space using contrastive learning. This allows for efficient cross-modal retrieval and understanding. Compared to the standard CLIP model, MobileClip-S2 is smaller in size and requires less computational power, making it ideal for fast inference on mobile devices. The model is widely used in image search, image-text matching, and multimodal AI applications, supporting joint processing of images and text to perform tasks such as image classification and image caption generation.

Source model

  • Input shape: 1x3x256x256, 1x77
  • Number of parameters: 35.7M, 63.4M
  • Model size: 141M, 243.77M
  • Output shape: 1x512, 1x512

Source model repository: MobileClip-S2

Performance Reference

Please search model by model name in Model Farm

Inference & Model Conversion

Please search model by model name in Model Farm

License

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support