Update TFLite models, all images, and README.md with detailed usage

Browse files

Files changed (14) hide show

.gitattributes +4 -0
README.md +202 -3
images/combined_size_vs_performance_xkcd.png +3 -0
images/enhanced_vs_simple_params_vs_perf_xkcd.png +3 -0
images/mobileclip_size_vs_performance_xkcd.png +3 -0
images/photo_2025-03-25_16-05-36.jpg +3 -0
mobileclip_b_datacompdr_first.tflite +3 -0
mobileclip_b_datacompdr_last.tflite +3 -0
mobileclip_b_datacompdr_lt_first.tflite +3 -0
mobileclip_b_datacompdr_lt_last.tflite +3 -0
mobileclip_s1_datacompdr_first.tflite +3 -0
mobileclip_s1_datacompdr_last.tflite +3 -0
mobileclip_s2_datacompdr_first.tflite +3 -0
mobileclip_s2_datacompdr_last.tflite +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+images/combined_size_vs_performance_xkcd.png filter=lfs diff=lfs merge=lfs -text
+images/enhanced_vs_simple_params_vs_perf_xkcd.png filter=lfs diff=lfs merge=lfs -text
+images/mobileclip_size_vs_performance_xkcd.png filter=lfs diff=lfs merge=lfs -text
+images/photo_2025-03-25_16-05-36.jpg filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,202 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0 # Or your preferred license: mit, cc-by-4.0, etc.
+tags:
+- mobileclip
+- tflite
+- computer-vision
+- multimodal
+- image-text
+- zero-shot-image-classification
+- image-retrieval
+---
+# MobileCLIP TFLite Models (anton96vice/mobileclip2_tflite)
+This repository hosts TFLite-quantized versions of MobileCLIP models, designed for efficient on-device inference.
+MobileCLIP is a family of CLIP (Contrastive Language-Image Pre-training) models optimized for mobile and edge devices.
+## Available Models
+The following TFLite models are included:
+- mobileclip_b_datacompdr_first.tflite
+- mobileclip_b_datacompdr_last.tflite
+- mobileclip_b_datacompdr_lt_first.tflite
+- mobileclip_b_datacompdr_lt_last.tflite
+- mobileclip_s1_datacompdr_first.tflite
+- mobileclip_s1_datacompdr_last.tflite
+- mobileclip_s2_datacompdr_first.tflite
+- mobileclip_s2_datacompdr_last.tflite
+These models can be used for tasks like:
+* Zero-shot image classification
+* Image-text retrieval
+* Semantic search
+## Performance Visualization (Featured)
+The chart below illustrates the size vs. performance trade-offs for some of these models, using `combined_size_vs_performance_xkcd.png`:
+![Size vs Performance](images/combined_size_vs_performance_xkcd.png)
+*(This image is `images/combined_size_vs_performance_xkcd.png`)*
+Additional visualizations and images can be found in the `images/` folder of this repository.
+## How to Use (TFLite Inference)
+Using these TFLite models involves loading them with a TensorFlow Lite interpreter and then performing inference to get image and text embeddings. Below is a conceptual guide. You'll need to adapt it to your specific platform (Python, Android, iOS, etc.) and the exact input/output signatures of these models.
+**Key Steps:**
+1.  **Download a Model:**
+    Choose a `.tflite` model from this repository. You can download it using `curl` or directly from the "Files and versions" tab.
+    ```bash
+    curl -L -O https://huggingface.co/anton96vice/mobileclip2_tflite/resolve/main/mobileclip_s2_datacompdr_last.tflite
+    ```
+2.  **Initialize TFLite Interpreter:**
+    Load the chosen `.tflite` model file.
+    ```python
+    import tensorflow as tf
+    import numpy as np
+    # from PIL import Image # For image preprocessing
+    # Load the TFLite model and allocate tensors.
+    interpreter = tf.lite.Interpreter(model_path="mobileclip_s2_datacompdr_last.tflite") # Or your chosen model file
+    interpreter.allocate_tensors()
+    # Get input and output tensor details.
+    input_details = interpreter.get_input_details()
+    output_details = interpreter.get_output_details()
+    print("Input Details:", input_details)
+    print("Output Details:", output_details)
+    ```
+    *   **Inspect `input_details` and `output_details` carefully!** This will tell you the expected shape, data type (e.g., `float32`, `int32`), and names of the input and output tensors. MobileCLIP models typically have separate inputs/outputs for the image tower and the text tower.
+3.  **Preprocess Inputs:**
+    *   **Image Input:**
+        *   **Load and Resize:** Load your image (e.g., using PIL/Pillow in Python). Resize it to the dimensions expected by the model's image tower (e.g., 224x224, 256x256 – check model documentation or `input_details`).
+        *   **Normalize:** Normalize pixel values. Common normalization for CLIP models is to scale to `[0, 1]` and then normalize using ImageNet mean and standard deviation.
+        *   **Convert to Tensor:** Convert the preprocessed image to a NumPy array with the correct shape (e.g., `[1, height, width, 3]`) and data type (`float32`).
+        ```python
+        # Example Image Preprocessing (conceptual - adapt to your model's specifics)
+        # from PIL import Image
+        #
+        # def preprocess_image(image_path, input_shape):
+        #     img = Image.open(image_path).convert('RGB')
+        #     img = img.resize((input_shape[1], input_shape[2])) # Assuming HWC format for shape
+        #     img_array = np.array(img, dtype=np.float32) / 255.0 # Scale to [0,1]
+        #
+        #     # Example normalization (adjust if different for your model)
+        #     mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
+        #     std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
+        #     img_array = (img_array - mean) / std
+        #
+        #     return np.expand_dims(img_array, axis=0) # Add batch dimension
+        # Assuming image_input_index is the index for image tensor from input_details
+        # image_input_tensor_index = input_details[image_input_index]['index']
+        # image_input_shape = input_details[image_input_index]['shape']
+        # preprocessed_image = preprocess_image("your_image.jpg", image_input_shape)
+        # interpreter.set_tensor(image_input_tensor_index, preprocessed_image)
+        ```
+    *   **Text Input:**
+        *   **Tokenize:** Convert your text prompts into token IDs using the specific tokenizer associated with the MobileCLIP model variant you are using. This is a **critical step**. The tokenizer vocabulary and tokenization process must match what the model was trained with. You might need to find the original model's tokenizer (e.g., from Hugging Face Transformers or the original MobileCLIP repository).
+        *   **Pad/Truncate:** Ensure the sequence of token IDs has the fixed length expected by the model's text tower (check `input_details`). Pad shorter sequences or truncate longer ones.
+        *   **Convert to Tensor:** Convert the token IDs to a NumPy array with the correct shape (e.g., `[1, sequence_length]`) and data type (`int32`).
+        ```python
+        # Example Text Preprocessing (conceptual - you'll need the correct tokenizer)
+        #
+        # def tokenize_text(text, tokenizer, max_length):
+        #     # This is highly dependent on the actual tokenizer used by MobileCLIP
+        #     # For example, if using a Hugging Face tokenizer:
+        #     # inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=max_length)
+        #     # return inputs['input_ids'] # Or other relevant tokenizer output
+        #     # For a simple BPE or SentencePiece tokenizer, it would be different.
+        #     # Placeholder for conceptual demonstration:
+        #     token_ids = np.random.randint(0, 30000, size=(1, max_length), dtype=np.int32) # Replace with actual tokenization
+        #     return token_ids
+        # Assuming text_input_index is the index for text tensor from input_details
+        # text_input_tensor_index = input_details[text_input_index]['index']
+        # text_input_shape = input_details[text_input_index]['shape']
+        # max_seq_len = text_input_shape[1] # Assuming shape is [batch, seq_len]
+        # texts = ["a photo of a cat", "a drawing of a dog"]
+        # your_tokenizer = None # Load/initialize your specific tokenizer here
+        #
+        # for i, text_prompt in enumerate(texts):
+        #     tokenized_prompt = tokenize_text(text_prompt, your_tokenizer, max_seq_len)
+        #     # If model processes one text at a time, or if it batches texts (adapt accordingly)
+        #     interpreter.set_tensor(text_input_tensor_index, tokenized_prompt)
+        #     # Run inference for this text (or batch later)
+        ```
+        **Important:** Identifying and correctly using the tokenizer is often the most challenging part for pre-trained multimodal models.
+4.  **Run Inference:**
+    Execute the model.
+    ```python
+    interpreter.invoke()
+    ```
+5.  **Get Embeddings:**
+    Extract the output embeddings (vectors). There will typically be an image embedding and a text embedding.
+    ```python
+    # Assuming image_output_index and text_output_index from output_details
+    # image_output_tensor_index = output_details[image_output_index]['index']
+    # text_output_tensor_index = output_details[text_output_index]['index']
+    #
+    # image_embedding = interpreter.get_tensor(image_output_tensor_index)
+    # text_embedding = interpreter.get_tensor(text_output_tensor_index)
+    #
+    # print("Image Embedding Shape:", image_embedding.shape)
+    # print("Text Embedding Shape:", text_embedding.shape)
+    ```
+6.  **Compute Similarity (Example: Zero-Shot Classification):**
+    Calculate the cosine similarity between the image embedding and the text embeddings of your candidate labels.
+    ```python
+    # from sklearn.metrics.pairwise import cosine_similarity
+    #
+    # # Example: image_embedding from step 5, and multiple text_embeddings for labels
+    # # text_embeddings_for_labels = np.array([...]) # Shape: (num_labels, embed_dim)
+    # # similarities = cosine_similarity(image_embedding, text_embeddings_for_labels)
+    # # predicted_label_index = np.argmax(similarities)
+    # # your_labels = ["label1", "label2", ...]
+    # # predicted_label = your_labels[predicted_label_index]
+    # # print(f"Predicted label: {predicted_label}")
+    ```
+**Note on Model Signatures:**
+If your TFLite model has named signatures (e.g., one for getting image features, one for text features), you'd use them like this:
+```python
+# # Get the runner for a specific signature
+# # Ensure you know the correct signature names for your model.
+# # image_runner = interpreter.get_signature_runner('serving_default_image_tower_signature_name') # Replace with actual name
+# # text_runner = interpreter.get_signature_runner('serving_default_text_tower_signature_name')   # Replace with actual name
+# # For image feature extraction
+# # output_image = image_runner(name_of_input_image_tensor=preprocessed_image)
+# # image_embedding = output_image['name_of_output_image_feature_tensor']
+# # For text feature extraction
+# # output_text = text_runner(name_of_input_text_tensor=tokenized_prompt)
+# # text_embedding = output_text['name_of_output_text_feature_tensor']
+```
+Check the `input_details`, `output_details`, or any metadata associated with the TFLite model (e.g., using Netron app) to understand its specific input/output structure and signature names.
+Refer to the original MobileCLIP project documentation for more precise details on preprocessing, tokenization, and model architecture if available.
+## Files in this Repository
+*   **\*.tflite**: The TFLite model files.
+*   **images/**: This folder contains all uploaded visualization images, including:
+    - (No image files found in staging/images/)
+*   **README.md**: This file, providing information about the models.
+## Citation
+If you use these models, please consider citing the original MobileCLIP paper (if applicable) or the source project.
+*(Consider adding: [MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/abs/2311.17045) by Pavan Kumar Anasosalu Vasu et al. or adjust as needed.)*

images/combined_size_vs_performance_xkcd.png ADDED Viewed

Git LFS Details

SHA256: 78d5b8aad41135712565633e7b7f4888f626fc8e5d031adebb59db1265c4e851
Pointer size: 131 Bytes
Size of remote file: 294 kB

images/enhanced_vs_simple_params_vs_perf_xkcd.png ADDED Viewed

Git LFS Details

SHA256: 7942c9a9df06f7ab7c452c8c4f91cf0cbaa900467d1956153a0f6b933c081776
Pointer size: 131 Bytes
Size of remote file: 287 kB

images/mobileclip_size_vs_performance_xkcd.png ADDED Viewed

Git LFS Details

SHA256: 1576bd896d73a544660af5300d1e8ee999933a6003179dacad5c0d8946683688
Pointer size: 131 Bytes
Size of remote file: 296 kB

images/photo_2025-03-25_16-05-36.jpg ADDED Viewed

Git LFS Details

SHA256: d37db60cfd67a1fed0d5554a358c55130e8e09b3fca0fe39f7dc8938627bbc1d
Pointer size: 131 Bytes
Size of remote file: 127 kB

mobileclip_b_datacompdr_first.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d15add41dc7edf4f7266b8446b12bfb2c9af0e820e1c5bd835b7c8d7fc9cd7b5
+size 599415024

mobileclip_b_datacompdr_last.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:5a85629df5c1c0c2809a00ec242eabd43e00b0a66018311973ea786b547ce8f5
+size 599415120

mobileclip_b_datacompdr_lt_first.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:1183d2d4044578ccf5cf55db3269fd5ef180016b8c184850de7a94b7927b0ddc
+size 599415024

mobileclip_b_datacompdr_lt_last.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:51b011600061d4ed59f33bcd5f0d016991be01eaaa849e48c62b6c6ae406cf48
+size 599415120

mobileclip_s1_datacompdr_first.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:bd37c62e6721e10354102e29c69f0930919eaa4a372856652fd9891396a58b4f
+size 339921976

mobileclip_s1_datacompdr_last.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:957e3fb2c031a360f90e5c987c23b5cadce1e0abf0093fd62e97e2cf1c9462bd
+size 339921976

mobileclip_s2_datacompdr_first.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:eba25e2cb57ae4fac7ddb5436d1a53245981f11055ee80edfbbd41a2eae539e9
+size 396881784

mobileclip_s2_datacompdr_last.tflite ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:d9dc97573d4c190e722e3bc95795a6ca866b36d8ea045be5aea3db586728baa2
+size 396881784