ashvardanian commited on
Commit
5582c4e
1 Parent(s): aa6eae0

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +40 -94
README.md CHANGED
@@ -30,55 +30,51 @@ datasets:
30
  - sbu_captions
31
  - visual_genome
32
  - ChristophSchuhmann/MS_COCO_2017_URL_TEXT
 
33
  ---
34
 
35
  <h1 align="center">UForm</h1>
36
  <h3 align="center">
37
- Multi-Modal Inference Library<br/>
38
- For Semantic Search Applications<br/>
 
39
  </h3>
40
 
41
  ---
42
 
43
- UForm is a Multi-Modal Modal Inference package, designed to encode Multi-Lingual Texts, Images, and, soon, Audio, Video, and Documents, into a shared vector space!
 
44
 
45
- This is model card of the __Multilingual model__ (21 languages) with:
 
46
 
47
- * 12 layers BERT (8 layers for unimodal encoding and rest layers for multimodal encoding)
48
- * ViT-B/16 (image resolution is 224x224)
49
-
50
- The model was trained on balanced multilingual dataset.
51
-
52
- If you need English model, check [this](https://huggingface.co/unum-cloud/uform-vl-english).
53
 
54
  ## Evaluation
55
 
56
  For all evaluations, the multimodal part was used unless otherwise stated.
57
 
58
- **Monolingual**
59
 
60
  | Dataset | Recall@1 | Recall@5 | Recall@10 |
61
  | :-------- | ------: | --------: | --------: |
62
  | Zero-Shot Flickr | 0.558 | 0.813 | 0.874 |
63
- | MS-COCO (train split was in training data) | 0.401 | 0.680 | 0.781 |
64
-
65
- **Multilingual**
66
 
67
- [XTD-10](https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10)
68
 
69
- Metric is recall@10
70
 
 
71
 
72
  | English | German | Spanish | French | Italian | Russian | Japanese | Korean | Turkish | Chinese | Polish |
73
  | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | ------:|
74
  | 96.1 | 93.5 | 95.7 | 94.1 | 94.4 | 90.4 | 90.2 | 91.3 | 95.2 | 93.8 | 95.8 |
75
 
76
-
77
- [COCO-SM](https://github.com/kimihailv/coco-sm/tree/main)
78
-
79
- For this evaluation only unimodal part was used.
80
-
81
- Recall
82
 
83
  | Target Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
84
  | :-------------------- | -----------: | ------------: | -----------: | -------------:| ------------: | --------------:| -------: |
@@ -109,17 +105,18 @@ Recall
109
  | Microsoft Translator | 27.2±6.4 | **31.4±3.6** | 50.8±9.8 | **57.7±4.7** | 61.4±10.6 | **68.9±4.6** | - |
110
  | Meta NLLB | 24.9±6.7 | **32.4±3.5** | 47.5±10.3 | **58.9±4.5** | 58.2±11.2 | **70.2±4.3** | - |
111
 
112
- NDCG@20
113
 
114
  | | Arabic | Armenian | Chinese | French | German | Hebrew | Hindi | Indonesian | Italian | Japanese | Korean | Persian | Polish | Portuguese | Russian | Spanish | Thai | Turkish | Ukranian | Vietnamese | Mean (all) | Mean (Google Translate) | Mean(Microsoft Translator) | Mean(NLLB)
115
  | :------------ | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
116
  | OpenCLIP NDCG | 0.639 | 0.204 | 0.731 | 0.823 | 0.806 | 0.657 | 0.616 | 0.733 | 0.811 | 0.737 | 0.686 | 0.667 | 0.764 | 0.832 | 0.777 | 0.849 | 0.606 | 0.701 | 0.704 | 0.697 | 0.716 ± 0.149 | 0.732 ± 0.145 | 0.730 ± 0.149 | 0.686 ± 0.158
117
  | UForm NDCG | 0.868 | 0.691 | 0.880 | 0.932 | 0.927 | 0.791 | 0.879 | 0.870 | 0.930 | 0.885 | 0.869 | 0.831 | 0.897 | 0.897 | 0.906 | 0.939 | 0.822 | 0.898 | 0.851 | 0.818 | 0.875 ± 0.064 | 0.869 ± 0.063 | 0.869 ± 0.066 | 0.888 ± 0.064
118
 
 
119
  ## Installation
120
 
121
  ```bash
122
- pip install uform[torch]
123
  ```
124
 
125
  ## Usage
@@ -127,82 +124,31 @@ pip install uform[torch]
127
  To load the model:
128
 
129
  ```python
130
- import uform
131
-
132
- model, processor = uform.get_model('unum-cloud/uform-vl-multilingual-v2')
133
- ```
134
-
135
- To encode data:
136
 
137
- ```python
 
138
  from PIL import Image
139
 
140
- text = 'a small red panda in a zoo'
141
- image = Image.open('red_panda.jpg')
142
-
143
- image_data = processor.preprocess_image(image)
144
- text_data = processor.preprocess_text(text)
145
 
146
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
147
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
148
- joint_embedding = model.encode_multimodal(image=image_data, text=text_data)
 
149
  ```
150
 
151
- To get features:
152
 
153
  ```python
154
- image_features, image_embedding = model.encode_image(image_data, return_features=True)
155
- text_features, text_embedding = model.encode_text(text_data, return_features=True)
 
 
 
 
 
 
156
  ```
157
-
158
- These features can later be used to produce joint multimodal encodings faster, as the first layers of the transformer can be skipped:
159
-
160
- ```python
161
- joint_embedding = model.encode_multimodal(
162
- image_features=image_features,
163
- text_features=text_features,
164
- attention_mask=text_data['attention_mask']
165
- )
166
- ```
167
-
168
- There are two options to calculate semantic compatibility between an image and a text: [Cosine Similarity](#cosine-similarity) and [Matching Score](#matching-score).
169
-
170
- ### Cosine Similarity
171
-
172
- ```python
173
- import torch.nn.functional as F
174
-
175
- similarity = F.cosine_similarity(image_embedding, text_embedding)
176
- ```
177
-
178
- The `similarity` will belong to the `[-1, 1]` range, `1` meaning the absolute match.
179
-
180
- __Pros__:
181
-
182
- - Computationally cheap.
183
- - Only unimodal embeddings are required, unimodal encoding is faster than joint encoding.
184
- - Suitable for retrieval in large collections.
185
-
186
- __Cons__:
187
-
188
- - Takes into account only coarse-grained features.
189
-
190
-
191
- ### Matching Score
192
-
193
- Unlike cosine similarity, unimodal embedding are not enough.
194
- Joint embedding will be needed and the resulting `score` will belong to the `[0, 1]` range, `1` meaning the absolute match.
195
-
196
- ```python
197
- score = model.get_matching_scores(joint_embedding)
198
- ```
199
-
200
- __Pros__:
201
-
202
- - Joint embedding captures fine-grained features.
203
- - Suitable for re-ranking – sorting retrieval result.
204
-
205
- __Cons__:
206
-
207
- - Resource-intensive.
208
- - Not suitable for retrieval in large collections.
 
30
  - sbu_captions
31
  - visual_genome
32
  - ChristophSchuhmann/MS_COCO_2017_URL_TEXT
33
+ - Ziyang/yfcc15m
34
  ---
35
 
36
  <h1 align="center">UForm</h1>
37
  <h3 align="center">
38
+ Pocket-Sized Multimodal AI<br/>
39
+ For Content Understanding and Generation<br/>
40
+ In Python, JavaScript, and Swift<br/>
41
  </h3>
42
 
43
  ---
44
 
45
+ The `uform3-image-text-multilingual-base` UForm model is a tiny vision and multilingual language encoder, covering __21 languages__, mapping them into a shared vector space.
46
+ This model produces up to __256-dimensional embeddings__ and is made of:
47
 
48
+ * Text encoder: 12-layer BERT for up to 50 input tokens.
49
+ * Visual encoder: ViT-B/16 for images of 224 x 224 resolution.
50
 
51
+ Unlike most CLIP-like multomodal models, this model shares 4 layers between the text and visual encoder to allow for more data- and parameter-efficient training.
52
+ Also unlike most models, UForm provides checkpoints compatible with PyTorch, ONNX, and CoreML, covering the absolute majority of AI-capable devices, with pre-quantized weights and inference code.
53
+ If you need a larger, more accurate, or multilingual model, check our [HuggingFace Hub](https://huggingface.co/unum-cloud/).
54
+ For more details on running the model, check out the [UForm GitHub repository](https://github.com/unum-cloud/uform/).
 
 
55
 
56
  ## Evaluation
57
 
58
  For all evaluations, the multimodal part was used unless otherwise stated.
59
 
60
+ ### Monolingual
61
 
62
  | Dataset | Recall@1 | Recall@5 | Recall@10 |
63
  | :-------- | ------: | --------: | --------: |
64
  | Zero-Shot Flickr | 0.558 | 0.813 | 0.874 |
65
+ | MS-COCO ¹ | 0.401 | 0.680 | 0.781 |
 
 
66
 
67
+ > ¹ It's important to note, that the MS-COCO train split was present in the training data.
68
 
69
+ ### Multilingual
70
 
71
+ Recall@10 on the [XTD-10](https://github.com/adobe-research/Cross-lingual-Test-Dataset-XTD10) dataset:
72
 
73
  | English | German | Spanish | French | Italian | Russian | Japanese | Korean | Turkish | Chinese | Polish |
74
  | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | -------: | ------:|
75
  | 96.1 | 93.5 | 95.7 | 94.1 | 94.4 | 90.4 | 90.2 | 91.3 | 95.2 | 93.8 | 95.8 |
76
 
77
+ Recall@1, Recall@5, and Recall@10 on the [COCO-SM](https://github.com/kimihailv/coco-sm/tree/main) dataset:
 
 
 
 
 
78
 
79
  | Target Language | OpenCLIP @ 1 | UForm @ 1 | OpenCLIP @ 5 | UForm @ 5 | OpenCLIP @ 10 | UForm @ 10 | Speakers |
80
  | :-------------------- | -----------: | ------------: | -----------: | -------------:| ------------: | --------------:| -------: |
 
105
  | Microsoft Translator | 27.2±6.4 | **31.4±3.6** | 50.8±9.8 | **57.7±4.7** | 61.4±10.6 | **68.9±4.6** | - |
106
  | Meta NLLB | 24.9±6.7 | **32.4±3.5** | 47.5±10.3 | **58.9±4.5** | 58.2±11.2 | **70.2±4.3** | - |
107
 
108
+ For a deeper comparison of output ranking check the following table for the Normalized Discounted Cumulative Gains for the first 20 results - NDCG@20:
109
 
110
  | | Arabic | Armenian | Chinese | French | German | Hebrew | Hindi | Indonesian | Italian | Japanese | Korean | Persian | Polish | Portuguese | Russian | Spanish | Thai | Turkish | Ukranian | Vietnamese | Mean (all) | Mean (Google Translate) | Mean(Microsoft Translator) | Mean(NLLB)
111
  | :------------ | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: | ----: |
112
  | OpenCLIP NDCG | 0.639 | 0.204 | 0.731 | 0.823 | 0.806 | 0.657 | 0.616 | 0.733 | 0.811 | 0.737 | 0.686 | 0.667 | 0.764 | 0.832 | 0.777 | 0.849 | 0.606 | 0.701 | 0.704 | 0.697 | 0.716 ± 0.149 | 0.732 ± 0.145 | 0.730 ± 0.149 | 0.686 ± 0.158
113
  | UForm NDCG | 0.868 | 0.691 | 0.880 | 0.932 | 0.927 | 0.791 | 0.879 | 0.870 | 0.930 | 0.885 | 0.869 | 0.831 | 0.897 | 0.897 | 0.906 | 0.939 | 0.822 | 0.898 | 0.851 | 0.818 | 0.875 ± 0.064 | 0.869 ± 0.063 | 0.869 ± 0.066 | 0.888 ± 0.064
114
 
115
+
116
  ## Installation
117
 
118
  ```bash
119
+ pip install "uform[torch,onnx]"
120
  ```
121
 
122
  ## Usage
 
124
  To load the model:
125
 
126
  ```python
127
+ from uform import get_model, Modality
 
 
 
 
 
128
 
129
+ import requests
130
+ from io import BytesIO
131
  from PIL import Image
132
 
133
+ model_name = 'unum-cloud/uform3-image-text-multilingual-base'
134
+ modalities = [Modality.TEXT_ENCODER, Modality.IMAGE_ENCODER]
135
+ processors, models = get_model(model_name, modalities=modalities)
 
 
136
 
137
+ model_text = models[Modality.TEXT_ENCODER]
138
+ model_image = models[Modality.IMAGE_ENCODER]
139
+ processor_text = processors[Modality.TEXT_ENCODER]
140
+ processor_image = processors[Modality.IMAGE_ENCODER]
141
  ```
142
 
143
+ To encode the content:
144
 
145
  ```python
146
+ text = 'a cityscape bathed in the warm glow of the sun, with varied architecture and a towering, snow-capped mountain rising majestically in the background'
147
+ image_url = 'https://media-cdn.tripadvisor.com/media/photo-s/1b/28/6b/53/lovely-armenia.jpg'
148
+ image_url = Image.open(BytesIO(requests.get(image_url).content))
149
+
150
+ image_data = processor_image(image)
151
+ text_data = processor_text(text)
152
+ image_features, image_embedding = model_image.encode(image_data, return_features=True)
153
+ text_features, text_embedding = model_text.encode(text_data, return_features=True)
154
  ```