anton96vice commited on
Commit
285952a
·
verified ·
1 Parent(s): 045915a

Update TFLite models, all images, and README.md with detailed usage

Browse files
.gitattributes CHANGED
@@ -33,3 +33,7 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ images/combined_size_vs_performance_xkcd.png filter=lfs diff=lfs merge=lfs -text
37
+ images/enhanced_vs_simple_params_vs_perf_xkcd.png filter=lfs diff=lfs merge=lfs -text
38
+ images/mobileclip_size_vs_performance_xkcd.png filter=lfs diff=lfs merge=lfs -text
39
+ images/photo_2025-03-25_16-05-36.jpg filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,202 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0 # Or your preferred license: mit, cc-by-4.0, etc.
3
+ tags:
4
+ - mobileclip
5
+ - tflite
6
+ - computer-vision
7
+ - multimodal
8
+ - image-text
9
+ - zero-shot-image-classification
10
+ - image-retrieval
11
+ ---
12
+
13
+ # MobileCLIP TFLite Models (anton96vice/mobileclip2_tflite)
14
+
15
+ This repository hosts TFLite-quantized versions of MobileCLIP models, designed for efficient on-device inference.
16
+ MobileCLIP is a family of CLIP (Contrastive Language-Image Pre-training) models optimized for mobile and edge devices.
17
+
18
+ ## Available Models
19
+
20
+ The following TFLite models are included:
21
+ - mobileclip_b_datacompdr_first.tflite
22
+ - mobileclip_b_datacompdr_last.tflite
23
+ - mobileclip_b_datacompdr_lt_first.tflite
24
+ - mobileclip_b_datacompdr_lt_last.tflite
25
+ - mobileclip_s1_datacompdr_first.tflite
26
+ - mobileclip_s1_datacompdr_last.tflite
27
+ - mobileclip_s2_datacompdr_first.tflite
28
+ - mobileclip_s2_datacompdr_last.tflite
29
+
30
+ These models can be used for tasks like:
31
+ * Zero-shot image classification
32
+ * Image-text retrieval
33
+ * Semantic search
34
+
35
+ ## Performance Visualization (Featured)
36
+
37
+ The chart below illustrates the size vs. performance trade-offs for some of these models, using `combined_size_vs_performance_xkcd.png`:
38
+
39
+ ![Size vs Performance](images/combined_size_vs_performance_xkcd.png)
40
+
41
+ *(This image is `images/combined_size_vs_performance_xkcd.png`)*
42
+
43
+ Additional visualizations and images can be found in the `images/` folder of this repository.
44
+
45
+ ## How to Use (TFLite Inference)
46
+
47
+ Using these TFLite models involves loading them with a TensorFlow Lite interpreter and then performing inference to get image and text embeddings. Below is a conceptual guide. You'll need to adapt it to your specific platform (Python, Android, iOS, etc.) and the exact input/output signatures of these models.
48
+
49
+ **Key Steps:**
50
+
51
+ 1. **Download a Model:**
52
+ Choose a `.tflite` model from this repository. You can download it using `curl` or directly from the "Files and versions" tab.
53
+ ```bash
54
+ curl -L -O https://huggingface.co/anton96vice/mobileclip2_tflite/resolve/main/mobileclip_s2_datacompdr_last.tflite
55
+ ```
56
+
57
+ 2. **Initialize TFLite Interpreter:**
58
+ Load the chosen `.tflite` model file.
59
+ ```python
60
+ import tensorflow as tf
61
+ import numpy as np
62
+ # from PIL import Image # For image preprocessing
63
+
64
+ # Load the TFLite model and allocate tensors.
65
+ interpreter = tf.lite.Interpreter(model_path="mobileclip_s2_datacompdr_last.tflite") # Or your chosen model file
66
+ interpreter.allocate_tensors()
67
+
68
+ # Get input and output tensor details.
69
+ input_details = interpreter.get_input_details()
70
+ output_details = interpreter.get_output_details()
71
+
72
+ print("Input Details:", input_details)
73
+ print("Output Details:", output_details)
74
+ ```
75
+ * **Inspect `input_details` and `output_details` carefully!** This will tell you the expected shape, data type (e.g., `float32`, `int32`), and names of the input and output tensors. MobileCLIP models typically have separate inputs/outputs for the image tower and the text tower.
76
+
77
+ 3. **Preprocess Inputs:**
78
+
79
+ * **Image Input:**
80
+ * **Load and Resize:** Load your image (e.g., using PIL/Pillow in Python). Resize it to the dimensions expected by the model's image tower (e.g., 224x224, 256x256 – check model documentation or `input_details`).
81
+ * **Normalize:** Normalize pixel values. Common normalization for CLIP models is to scale to `[0, 1]` and then normalize using ImageNet mean and standard deviation.
82
+ * **Convert to Tensor:** Convert the preprocessed image to a NumPy array with the correct shape (e.g., `[1, height, width, 3]`) and data type (`float32`).
83
+ ```python
84
+ # Example Image Preprocessing (conceptual - adapt to your model's specifics)
85
+ # from PIL import Image
86
+ #
87
+ # def preprocess_image(image_path, input_shape):
88
+ # img = Image.open(image_path).convert('RGB')
89
+ # img = img.resize((input_shape[1], input_shape[2])) # Assuming HWC format for shape
90
+ # img_array = np.array(img, dtype=np.float32) / 255.0 # Scale to [0,1]
91
+ #
92
+ # # Example normalization (adjust if different for your model)
93
+ # mean = np.array([0.485, 0.456, 0.406], dtype=np.float32)
94
+ # std = np.array([0.229, 0.224, 0.225], dtype=np.float32)
95
+ # img_array = (img_array - mean) / std
96
+ #
97
+ # return np.expand_dims(img_array, axis=0) # Add batch dimension
98
+
99
+ # Assuming image_input_index is the index for image tensor from input_details
100
+ # image_input_tensor_index = input_details[image_input_index]['index']
101
+ # image_input_shape = input_details[image_input_index]['shape']
102
+ # preprocessed_image = preprocess_image("your_image.jpg", image_input_shape)
103
+ # interpreter.set_tensor(image_input_tensor_index, preprocessed_image)
104
+ ```
105
+
106
+ * **Text Input:**
107
+ * **Tokenize:** Convert your text prompts into token IDs using the specific tokenizer associated with the MobileCLIP model variant you are using. This is a **critical step**. The tokenizer vocabulary and tokenization process must match what the model was trained with. You might need to find the original model's tokenizer (e.g., from Hugging Face Transformers or the original MobileCLIP repository).
108
+ * **Pad/Truncate:** Ensure the sequence of token IDs has the fixed length expected by the model's text tower (check `input_details`). Pad shorter sequences or truncate longer ones.
109
+ * **Convert to Tensor:** Convert the token IDs to a NumPy array with the correct shape (e.g., `[1, sequence_length]`) and data type (`int32`).
110
+ ```python
111
+ # Example Text Preprocessing (conceptual - you'll need the correct tokenizer)
112
+ #
113
+ # def tokenize_text(text, tokenizer, max_length):
114
+ # # This is highly dependent on the actual tokenizer used by MobileCLIP
115
+ # # For example, if using a Hugging Face tokenizer:
116
+ # # inputs = tokenizer(text, return_tensors="np", padding="max_length", truncation=True, max_length=max_length)
117
+ # # return inputs['input_ids'] # Or other relevant tokenizer output
118
+ # # For a simple BPE or SentencePiece tokenizer, it would be different.
119
+ # # Placeholder for conceptual demonstration:
120
+ # token_ids = np.random.randint(0, 30000, size=(1, max_length), dtype=np.int32) # Replace with actual tokenization
121
+ # return token_ids
122
+
123
+ # Assuming text_input_index is the index for text tensor from input_details
124
+ # text_input_tensor_index = input_details[text_input_index]['index']
125
+ # text_input_shape = input_details[text_input_index]['shape']
126
+ # max_seq_len = text_input_shape[1] # Assuming shape is [batch, seq_len]
127
+ # texts = ["a photo of a cat", "a drawing of a dog"]
128
+ # your_tokenizer = None # Load/initialize your specific tokenizer here
129
+ #
130
+ # for i, text_prompt in enumerate(texts):
131
+ # tokenized_prompt = tokenize_text(text_prompt, your_tokenizer, max_seq_len)
132
+ # # If model processes one text at a time, or if it batches texts (adapt accordingly)
133
+ # interpreter.set_tensor(text_input_tensor_index, tokenized_prompt)
134
+ # # Run inference for this text (or batch later)
135
+ ```
136
+ **Important:** Identifying and correctly using the tokenizer is often the most challenging part for pre-trained multimodal models.
137
+
138
+ 4. **Run Inference:**
139
+ Execute the model.
140
+ ```python
141
+ interpreter.invoke()
142
+ ```
143
+
144
+ 5. **Get Embeddings:**
145
+ Extract the output embeddings (vectors). There will typically be an image embedding and a text embedding.
146
+ ```python
147
+ # Assuming image_output_index and text_output_index from output_details
148
+ # image_output_tensor_index = output_details[image_output_index]['index']
149
+ # text_output_tensor_index = output_details[text_output_index]['index']
150
+ #
151
+ # image_embedding = interpreter.get_tensor(image_output_tensor_index)
152
+ # text_embedding = interpreter.get_tensor(text_output_tensor_index)
153
+ #
154
+ # print("Image Embedding Shape:", image_embedding.shape)
155
+ # print("Text Embedding Shape:", text_embedding.shape)
156
+ ```
157
+
158
+ 6. **Compute Similarity (Example: Zero-Shot Classification):**
159
+ Calculate the cosine similarity between the image embedding and the text embeddings of your candidate labels.
160
+ ```python
161
+ # from sklearn.metrics.pairwise import cosine_similarity
162
+ #
163
+ # # Example: image_embedding from step 5, and multiple text_embeddings for labels
164
+ # # text_embeddings_for_labels = np.array([...]) # Shape: (num_labels, embed_dim)
165
+ # # similarities = cosine_similarity(image_embedding, text_embeddings_for_labels)
166
+ # # predicted_label_index = np.argmax(similarities)
167
+ # # your_labels = ["label1", "label2", ...]
168
+ # # predicted_label = your_labels[predicted_label_index]
169
+ # # print(f"Predicted label: {predicted_label}")
170
+ ```
171
+
172
+ **Note on Model Signatures:**
173
+ If your TFLite model has named signatures (e.g., one for getting image features, one for text features), you'd use them like this:
174
+ ```python
175
+ # # Get the runner for a specific signature
176
+ # # Ensure you know the correct signature names for your model.
177
+ # # image_runner = interpreter.get_signature_runner('serving_default_image_tower_signature_name') # Replace with actual name
178
+ # # text_runner = interpreter.get_signature_runner('serving_default_text_tower_signature_name') # Replace with actual name
179
+
180
+ # # For image feature extraction
181
+ # # output_image = image_runner(name_of_input_image_tensor=preprocessed_image)
182
+ # # image_embedding = output_image['name_of_output_image_feature_tensor']
183
+
184
+ # # For text feature extraction
185
+ # # output_text = text_runner(name_of_input_text_tensor=tokenized_prompt)
186
+ # # text_embedding = output_text['name_of_output_text_feature_tensor']
187
+ ```
188
+ Check the `input_details`, `output_details`, or any metadata associated with the TFLite model (e.g., using Netron app) to understand its specific input/output structure and signature names.
189
+ Refer to the original MobileCLIP project documentation for more precise details on preprocessing, tokenization, and model architecture if available.
190
+
191
+ ## Files in this Repository
192
+
193
+ * **\*.tflite**: The TFLite model files.
194
+ * **images/**: This folder contains all uploaded visualization images, including:
195
+ - (No image files found in staging/images/)
196
+ * **README.md**: This file, providing information about the models.
197
+
198
+ ## Citation
199
+
200
+ If you use these models, please consider citing the original MobileCLIP paper (if applicable) or the source project.
201
+ *(Consider adding: [MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training](https://arxiv.org/abs/2311.17045) by Pavan Kumar Anasosalu Vasu et al. or adjust as needed.)*
202
+
images/combined_size_vs_performance_xkcd.png ADDED

Git LFS Details

  • SHA256: 78d5b8aad41135712565633e7b7f4888f626fc8e5d031adebb59db1265c4e851
  • Pointer size: 131 Bytes
  • Size of remote file: 294 kB
images/enhanced_vs_simple_params_vs_perf_xkcd.png ADDED

Git LFS Details

  • SHA256: 7942c9a9df06f7ab7c452c8c4f91cf0cbaa900467d1956153a0f6b933c081776
  • Pointer size: 131 Bytes
  • Size of remote file: 287 kB
images/mobileclip_size_vs_performance_xkcd.png ADDED

Git LFS Details

  • SHA256: 1576bd896d73a544660af5300d1e8ee999933a6003179dacad5c0d8946683688
  • Pointer size: 131 Bytes
  • Size of remote file: 296 kB
images/photo_2025-03-25_16-05-36.jpg ADDED

Git LFS Details

  • SHA256: d37db60cfd67a1fed0d5554a358c55130e8e09b3fca0fe39f7dc8938627bbc1d
  • Pointer size: 131 Bytes
  • Size of remote file: 127 kB
mobileclip_b_datacompdr_first.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d15add41dc7edf4f7266b8446b12bfb2c9af0e820e1c5bd835b7c8d7fc9cd7b5
3
+ size 599415024
mobileclip_b_datacompdr_last.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:5a85629df5c1c0c2809a00ec242eabd43e00b0a66018311973ea786b547ce8f5
3
+ size 599415120
mobileclip_b_datacompdr_lt_first.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1183d2d4044578ccf5cf55db3269fd5ef180016b8c184850de7a94b7927b0ddc
3
+ size 599415024
mobileclip_b_datacompdr_lt_last.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:51b011600061d4ed59f33bcd5f0d016991be01eaaa849e48c62b6c6ae406cf48
3
+ size 599415120
mobileclip_s1_datacompdr_first.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:bd37c62e6721e10354102e29c69f0930919eaa4a372856652fd9891396a58b4f
3
+ size 339921976
mobileclip_s1_datacompdr_last.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:957e3fb2c031a360f90e5c987c23b5cadce1e0abf0093fd62e97e2cf1c9462bd
3
+ size 339921976
mobileclip_s2_datacompdr_first.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:eba25e2cb57ae4fac7ddb5436d1a53245981f11055ee80edfbbd41a2eae539e9
3
+ size 396881784
mobileclip_s2_datacompdr_last.tflite ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:d9dc97573d4c190e722e3bc95795a6ca866b36d8ea045be5aea3db586728baa2
3
+ size 396881784