MobileCLIP TFLite Models (anton96vice/mobileclip2_tflite)

This repository hosts TFLite-quantized versions of MobileCLIP models, designed for efficient on-device inference. MobileCLIP is a family of CLIP (Contrastive Language-Image Pre-training) models optimized for mobile and edge devices.

Available Models

The following TFLite models are included:

mobileclip_b_datacompdr_first.tflite
mobileclip_b_datacompdr_last.tflite
mobileclip_b_datacompdr_lt_first.tflite
mobileclip_b_datacompdr_lt_last.tflite
mobileclip_s1_datacompdr_first.tflite
mobileclip_s1_datacompdr_last.tflite
mobileclip_s2_datacompdr_first.tflite
mobileclip_s2_datacompdr_last.tflite

These models can be used for tasks like:

Zero-shot image classification
Image-text retrieval
Semantic search

Performance Visualization (Featured)

The chart below illustrates the size vs. performance trade-offs for some of these models, using combined_size_vs_performance_xkcd.png:

(This image is images/combined_size_vs_performance_xkcd.png)

Additional visualizations and images can be found in the images/ folder of this repository.

How to Use (TFLite Inference)

Using these TFLite models involves loading them with a TensorFlow Lite interpreter and then performing inference to get image and text embeddings. Below is a conceptual guide. You'll need to adapt it to your specific platform (Python, Android, iOS, etc.) and the exact input/output signatures of these models.

Key Steps:

Download a Model: Choose a .tflite model from this repository. You can download it using curl or directly from the "Files and versions" tab.
```
curl -L -O https://huggingface.co/anton96vice/mobileclip2_tflite/resolve/main/mobileclip_s2_datacompdr_last.tflite
```

Initialize TFLite Interpreter: Load the chosen .tflite model file.

import tensorflow as tf
import numpy as np
# from PIL import Image # For image preprocessing

# Load the TFLite model and allocate tensors.
interpreter = tf.lite.Interpreter(model_path="mobileclip_s2_datacompdr_last.tflite") # Or your chosen model file
interpreter.allocate_tensors()

# Get input and output tensor details.
input_details = interpreter.get_input_details()
output_details = interpreter.get_output_details()

print("Input Details:", input_details)
print("Output Details:", output_details)

Inspect input_details and output_details carefully! This will tell you the expected shape, data type (e.g., float32, int32), and names of the input and output tensors. MobileCLIP models typically have separate inputs/outputs for the image tower and the text tower.

Preprocess Inputs:

Image Input:

Load and Resize: Load your image (e.g., using PIL/Pillow in Python). Resize it to the dimensions expected by the model's image tower (e.g., 224x224, 256x256 – check model documentation or input_details).
Normalize: Normalize pixel values. Common normalization for CLIP models is to scale to [0, 1] and then normalize using ImageNet mean and standard deviation.
Convert to Tensor: Convert the preprocessed image to a NumPy array with the correct shape (e.g., [1, height, width, 3]) and data type (float32).

# Example Image Preprocessing (conceptual - adapt to your model's specifics)
# from PIL import Image
#
# def preprocess_image(image_path, input_shape):
#     img = Image.open(image_path).convert('RGB')
#     img = img.resize((input_shape[1], input_shape[2])) # Assuming HWC format for shape
#     img_array = np.array(img, dtype=np.float32) / 255.0 # Scale to [0,1]
#
#     # Example normalization (adjust if different for your model)
#     mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
#     std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
#     img_array = (img_array - mean) / std
#
#     return np.expand_dims(img_array, axis=0) # Add batch dimension

# Assuming image_input_index is the index for image tensor from input_details
# image_input_tensor_index = input_details[image_input_index]['index']
# image_input_shape = input_details[image_input_index]['shape']
# preprocessed_image = preprocess_image("your_image.jpg", image_input_shape)
# interpreter.set_tensor(image_input_tensor_index, preprocessed_image)

Text Input:

Tokenize: Convert your text prompts into token IDs using the specific tokenizer associated with the MobileCLIP model variant you are using. This is a critical step. The tokenizer vocabulary and tokenization process must match what the model was trained with. You might need to find the original model's tokenizer (e.g., from Hugging Face Transformers or the original MobileCLIP repository).
Pad/Truncate: Ensure the sequence of token IDs has the fixed length expected by the model's text tower (check input_details). Pad shorter sequences or truncate longer ones.
Convert to Tensor: Convert the token IDs to a NumPy array with the correct shape (e.g., [1, sequence_length]) and data type (int32).

# Example Text Preprocessing (conceptual - you'll need the correct tokenizer)
#
# def tokenize_text(text, tokenizer, max_length):
#     # This is highly dependent on the actual tokenizer used by MobileCLIP
#     # For example, if using a Hugging Face tokenizer:
#     # inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=max_length)
#     # return inputs['input_ids'] # Or other relevant tokenizer output
#     # For a simple BPE or SentencePiece tokenizer, it would be different.
#     # Placeholder for conceptual demonstration:
#     token_ids = np.random.randint(0, 30000, size=(1, max_length), dtype=np.int32) # Replace with actual tokenization
#     return token_ids

# Assuming text_input_index is the index for text tensor from input_details
# text_input_tensor_index = input_details[text_input_index]['index']
# text_input_shape = input_details[text_input_index]['shape']
# max_seq_len = text_input_shape[1] # Assuming shape is [batch, seq_len]
# texts = ["a photo of a cat", "a drawing of a dog"]
# your_tokenizer = None # Load/initialize your specific tokenizer here
#
# for i, text_prompt in enumerate(texts):
#     tokenized_prompt = tokenize_text(text_prompt, your_tokenizer, max_seq_len)
#     # If model processes one text at a time, or if it batches texts (adapt accordingly)
#     interpreter.set_tensor(text_input_tensor_index, tokenized_prompt)
#     # Run inference for this text (or batch later)

Important: Identifying and correctly using the tokenizer is often the most challenging part for pre-trained multimodal models.

Run Inference: Execute the model.
```
interpreter.invoke()
```

Get Embeddings: Extract the output embeddings (vectors). There will typically be an image embedding and a text embedding.

# Assuming image_output_index and text_output_index from output_details
# image_output_tensor_index = output_details[image_output_index]['index']
# text_output_tensor_index = output_details[text_output_index]['index']
#
# image_embedding = interpreter.get_tensor(image_output_tensor_index)
# text_embedding = interpreter.get_tensor(text_output_tensor_index)
#
# print("Image Embedding Shape:", image_embedding.shape)
# print("Text Embedding Shape:", text_embedding.shape)

Compute Similarity (Example: Zero-Shot Classification): Calculate the cosine similarity between the image embedding and the text embeddings of your candidate labels.

# from sklearn.metrics.pairwise import cosine_similarity
#
# # Example: image_embedding from step 5, and multiple text_embeddings for labels
# # text_embeddings_for_labels = np.array([...]) # Shape: (num_labels, embed_dim)
# # similarities = cosine_similarity(image_embedding, text_embeddings_for_labels)
# # predicted_label_index = np.argmax(similarities)
# # your_labels = ["label1", "label2", ...]
# # predicted_label = your_labels[predicted_label_index]
# # print(f"Predicted label: {predicted_label}")

Note on Model Signatures: If your TFLite model has named signatures (e.g., one for getting image features, one for text features), you'd use them like this:

# # Get the runner for a specific signature
# # Ensure you know the correct signature names for your model.
# # image_runner = interpreter.get_signature_runner('serving_default_image_tower_signature_name') # Replace with actual name
# # text_runner = interpreter.get_signature_runner('serving_default_text_tower_signature_name')   # Replace with actual name

# # For image feature extraction
# # output_image = image_runner(name_of_input_image_tensor=preprocessed_image)
# # image_embedding = output_image['name_of_output_image_feature_tensor']

# # For text feature extraction
# # output_text = text_runner(name_of_input_text_tensor=tokenized_prompt)
# # text_embedding = output_text['name_of_output_text_feature_tensor']

Check the input_details, output_details, or any metadata associated with the TFLite model (e.g., using Netron app) to understand its specific input/output structure and signature names. Refer to the original MobileCLIP project documentation for more precise details on preprocessing, tokenization, and model architecture if available.

Files in this Repository

*.tflite: The TFLite model files.
images/: This folder contains all uploaded visualization images, including:
- (No image files found in staging/images/)
README.md: This file, providing information about the models.

Citation

If you use these models, please consider citing the original MobileCLIP paper (if applicable) or the source project. (Consider adding: MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training by Pavan Kumar Anasosalu Vasu et al. or adjust as needed.)