nan commited on
Commit
40aa643
·
1 Parent(s): a5838bd

docs: cherry-pick README from pr/17 e3e8a244

Browse files
Files changed (1) hide show
  1. README.md +259 -60
README.md CHANGED
@@ -1,92 +1,291 @@
1
- # Jina Embeddings V4
2
 
 
 
 
3
 
4
- ## Examples
5
 
6
- Encode functions:
 
 
7
 
8
- ```python
9
- import torch
10
- from transformers import AutoModel
11
- from PIL import Image
12
 
13
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
14
 
15
- # Load model
16
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v4', trust_remote_code=True)
17
- model = model.to(device)
18
 
19
- # Sample data
20
- texts = ["Here is some sample code", "This is a matching text"]
21
- image_paths = ['/<path_to_image>']
22
- images = [Image.open(path) for path in image_paths]
23
 
24
- # Example 1: Text matching task with single vector embeddings
25
- # Generate embeddings with dimension truncation (256), decrease max_pixels
26
- img_embeddings = model.encode_images(images=images, truncate_dim=256, max_pixels=602112, task='text-matching')
27
- text_embeddings = model.encode_texts(texts=texts, truncate_dim=256, max_length=512, task='text-matching')
28
 
29
- # Example 2: Retrieval task with multi-vector embeddings
30
- model.set_task(task='retrieval')
 
 
31
 
32
- # Generate multi-vector embeddings
33
- img_embeddings = model.encode_images(images=images, vector_type='multi_vector')
34
- text_embeddings = model.encode_texts(texts=texts, vector_type='multi_vector', prompt_name='passage')
35
 
36
- # Example 3: Code task with single vector embeddings
37
- code = ["def hello_world():\n print('Hello, World!')"]
38
- code_embeddings = model.encode_texts(texts=code, task='code')
 
 
39
 
40
- ```
41
 
42
- Using the model forward:
 
 
 
 
 
 
 
 
 
 
 
 
43
 
44
- ```python
45
- import torch
46
- from transformers import AutoModel, AutoProcessor
47
- from PIL import Image
48
 
49
- device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
50
 
51
- # Load model and processor
52
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v4', trust_remote_code=True)
53
- model = model.to(device)
54
- processor = AutoProcessor.from_pretrained('jinaai/jina-embeddings-v4', trust_remote_code=True)
55
 
56
 
57
- # Sample data
58
- texts = ["Here is some sample code", "This is a matching text"]
59
- image_paths = ['/<path_to_image>']
60
 
61
- # Process text and images
62
- text_batch = processor.process_texts(texts=texts, prefix="Query", max_length=512)
63
- images = [Image.open(path) for path in image_paths]
64
- image_batch = processor.process_images(images=images)
 
 
 
 
 
 
 
 
 
65
 
66
- # Forward pass
67
- model.eval()
68
- with torch.no_grad():
69
- text_batch = {k: v.to(device) for k, v in text_batch.items()}
70
- image_batch = {k: v.to(device) for k, v in image_batch.items()}
71
-
72
- with torch.autocast(device_type='cuda' if torch.cuda.is_available() else 'cpu'):
73
- # Get embeddings
74
- text_embeddings = model.model(**text_batch, task_label='retrieval').single_vec_emb
75
- img_embeddings = model.model(**image_batch, task_label='retrieval').single_vec_emb
76
 
 
77
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
78
  ```
79
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
80
 
81
- Inference via the `SentenceTransformer` library:
 
 
 
 
 
 
 
82
 
 
 
 
83
  ```python
84
  from sentence_transformers import SentenceTransformer
85
 
86
- model = SentenceTransformer(
87
- 'jinaai/jina-embeddings-v4', trust_remote_code=True
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
88
  )
89
 
90
- emb = model.encode(['Khinkali is the best'], task='retrieval', prompt_name='query')
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
- ```
 
1
+ <br><br>
2
 
3
+ <p align="center">
4
+ <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
5
+ </p>
6
 
 
7
 
8
+ <p align="center">
9
+ <b>The embedding model trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
10
+ </p>
11
 
12
+ <p align="center">
13
+ <b>Jina Embeddings v4: Multilingual Multimodal Embeddings</b>
14
+ </p>
 
15
 
 
16
 
17
+ ## Quick Start
 
 
18
 
19
+ [Blog](https://alwaysjudgeabookbyitscover.com/) | [Technical Report](https://puginarug.com) | [API](https://jina.ai/embeddings)
 
 
 
20
 
 
 
 
 
21
 
22
+ ## Intended Usage & Model Info
23
+ `jina-embeddings-v4` is a multilingual, multimodal embedding model designed for unified representation of text and images.
24
+ The model is specialized for complex document retrieval, including visually rich documents with charts, tables, and illustrations.
25
+ Embeddings produced by `jina-embeddings-v4` serve as the backbone for neural information retrieval and multimodal GenAI applications.
26
 
 
 
 
27
 
28
+ Built based on [Qwen/Qwen2.5-VL-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-3B-Instruct), `jina-embeddings-v4` has the following features:
29
+ - **Unified embeddings** for text, images, and documents, supporting both dense (single-vector) and late-interaction (multi-vector) retrieval.
30
+ - **Multilingual support** (20+ languages) and compatibility with a wide range of domains, including technical and visually complex documents.
31
+ - **Task-specific adapters** for retrieval, text matching, and code-related tasks, which can be selected at inference time.
32
+ - **Flexible embedding size**: dense embeddings are 2048 dimensions by default but can be truncated to as low as 128 with minimal performance loss.
33
 
 
34
 
35
+ Summary of features:
36
+ | Feature | Jina Embeddings V4 |
37
+ |------------|------------|
38
+ | Base Model | Qwen2.5-VL-3B-Instruct |
39
+ | Supported Tasks | `retrieval`, `text-matching`, `code` |
40
+ | Model DType | BFloat 16 |
41
+ | Max Sequence Length | 32768 |
42
+ | Single-Vector Dimension | 2048 |
43
+ | Multi-Vector Dimension | 128 |
44
+ | Matryoshka dimensions | 128, 256, 512, 1024, 2048 |
45
+ | Attention Mechanism | FlashAttention2 |
46
+ | Pooling Strategy | Mean pooling |
47
+
48
 
 
 
 
 
49
 
50
+ ## Training, Data, Parameters
51
 
52
+ Please refer to our [technical report of jina-embeddings-v4](https://puginarug.com) for the model and training details.
 
 
 
53
 
54
 
55
+ ## Usage
 
 
56
 
57
+ <details>
58
+ <summary>Requirements</a></summary>
59
+
60
+ The following Python packages are required:
61
+ - `transformers>=4.52.0`
62
+ - `torch>=2.6.0`
63
+ - `peft>=0.15.2`
64
+ - `torchvision`
65
+ - `pillow`
66
+
67
+ ### Optional / Recommended
68
+ - **flash-attention**: Installing [flash-attention](https://github.com/Dao-AILab/flash-attention) is recommended for improved inference speed and efficiency, but not mandatory.
69
+ - **sentence-transformers**: If you want to use the model via the `sentence-transformers` interface, install this package as well.
70
 
 
 
 
 
 
 
 
 
 
 
71
 
72
+ </details>
73
 
74
+
75
+ <details>
76
+ <summary>via Jina AI <a href="https://jina.ai/embeddings/">Embedding API</a></summary>
77
+
78
+ Needs to be adjusted for V4
79
+ ```bash
80
+ curl https://api.jina.ai/v1/embeddings \
81
+ -H "Content-Type: application/json" \
82
+ -H "Authorization: Bearer [JINA_AI_API_TOKEN]" \
83
+ -d @- <<EOFEOF
84
+ {
85
+ "model": "jina-embeddings-v4",
86
+ "dimensions": 1024,
87
+ "task": "retrieval.query",
88
+ "normalized": true,
89
+ "embedding_type": "float",
90
+ "input": [
91
+ {
92
+ "text": "غروب جميل على الشاطئ"
93
+ },
94
+ {
95
+ "text": "海滩上美丽的日落"
96
+ },
97
+ {
98
+ "text": "A beautiful sunset over the beach"
99
+ },
100
+ {
101
+ "text": "Un beau coucher de soleil sur la plage"
102
+ },
103
+ {
104
+ "text": "Ein wunderschöner Sonnenuntergang am Strand"
105
+ },
106
+ {
107
+ "text": "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία"
108
+ },
109
+ {
110
+ "text": "समुद्र तट पर एक खूबसूरत सूर्यास्त"
111
+ },
112
+ {
113
+ "text": "Un bellissimo tramonto sulla spiaggia"
114
+ },
115
+ {
116
+ "text": "浜辺に沈む美しい夕日"
117
+ },
118
+ {
119
+ "text": "해변 위로 아름다운 일몰"
120
+ },
121
+ {
122
+ "image": "https://i.ibb.co/nQNGqL0/beach1.jpg"
123
+ },
124
+ {
125
+ "image": "https://i.ibb.co/r5w8hG8/beach2.jpg"
126
+ }
127
+ ]
128
+ }
129
+ EOFEOF
130
  ```
131
 
132
+ </details>
133
+
134
+ <details>
135
+ <summary>via <a href="https://huggingface.co/docs/transformers/en/index">transformers</a></summary>
136
+
137
+ ```python
138
+ # !pip install transformers>=4.52.0 torch>=2.6.0 peft>=0.15.2 torchvision pillow
139
+ # !pip install
140
+ from transformers import AutoModel
141
+
142
+ # Initialize the model
143
+ model = AutoModel.from_pretrained("jinaai/jina-embeddings-v4", trust_remote_code=True)
144
+ # ========================
145
+ # 1. Retrieval Task
146
+ # ========================
147
+ # Configure truncate_dim, max_length (for texts), max_pixels (for images), vector_type, batch_size in the encode function if needed
148
+
149
+ # Encode query
150
+ query_embedding = model.encode_texts(
151
+ texts=["Overview of climate change impacts on coastal cities"],
152
+ task="retrieval",
153
+ prompt_name="query",
154
+ )[0]
155
+
156
+ # Encode passage (text)
157
+ passage_embedding = model.encode_texts(
158
+ texts=[
159
+ "Climate change has led to rising sea levels, increased frequency of extreme weather events..."
160
+ ],
161
+ task="retrieval",
162
+ prompt_name="passage",
163
+ )[0]
164
+
165
+ # Encode image/document
166
+ image_embedding = model.encode_images(
167
+ images=["https://i.ibb.co/nQNGqL0/beach1.jpg"],
168
+ task="retrieval",
169
+ )[0]
170
+
171
+ # ========================
172
+ # 2. Text Matching Task
173
+ # ========================
174
+ texts = [
175
+ "غروب جميل على الشاطئ", # Arabic
176
+ "海滩上美丽的日落", # Chinese
177
+ "Un beau coucher de soleil sur la plage", # French
178
+ "Ein wunderschöner Sonnenuntergang am Strand", # German
179
+ "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", # Greek
180
+ "समुद्र तट पर एक खूबसूरत सूर्यास्त", # Hindi
181
+ "Un bellissimo tramonto sulla spiaggia", # Italian
182
+ "浜辺に沈む美しい夕日", # Japanese
183
+ "해변 위로 아름다운 일몰", # Korean
184
+ ]
185
+
186
+ text_embeddings = model.encode_texts(texts=texts, task="text-matching")
187
+
188
+ # ========================
189
+ # 3. Code Understanding Task
190
+ # ========================
191
+
192
+ # Encode query
193
+ query_embedding = model.encode_texts(
194
+ texts=["Find a function that prints a greeting message to the console"],
195
+ task="code",
196
+ prompt_name="query",
197
+ )
198
 
199
+ # Encode code
200
+ code_embeddings = model.encode_texts(
201
+ texts=["def hello_world():\n print('Hello, World!')"],
202
+ task="code",
203
+ prompt_name="passage",
204
+ )
205
+ ```
206
+ </details>
207
 
208
+ <details>
209
+ <summary>via <a href="https://sbert.net/">sentence-transformers</a></summary>
210
+
211
  ```python
212
  from sentence_transformers import SentenceTransformer
213
 
214
+ # Initialize the model
215
+ model = SentenceTransformer("jinaai/jina-embeddings-v4", trust_remote_code=True)
216
+ # ========================
217
+ # 1. Retrieval Task
218
+ # ========================
219
+ # Encode query
220
+ query_embedding = model.encode(
221
+ sentences=["Overview of climate change impacts on coastal cities"],
222
+ task="retrieval",
223
+ prompt_name="query",
224
+ )[0]
225
+
226
+ # Encode passage (text)
227
+ passage_embedding = model.encode(
228
+ sentences=[
229
+ "Climate change has led to rising sea levels, increased frequency of extreme weather events..."
230
+ ],
231
+ task="retrieval",
232
+ prompt_name="passage",
233
+ )[0]
234
+
235
+ # Encode image/document
236
+ image_embedding = model.encode(
237
+ sentences=["https://i.ibb.co/nQNGqL0/beach1.jpg"],
238
+ task="retrieval",
239
+ )[0]
240
+
241
+ # ========================
242
+ # 2. Text Matching Task
243
+ # ========================
244
+ texts = [
245
+ "غروب جميل على الشاطئ", # Arabic
246
+ "海滩上美丽的日落", # Chinese
247
+ "Un beau coucher de soleil sur la plage", # French
248
+ "Ein wunderschöner Sonnenuntergang am Strand", # German
249
+ "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", # Greek
250
+ "समुद्र तट पर एक खूबसूरत सूर्यास्त", # Hindi
251
+ "Un bellissimo tramonto sulla spiaggia", # Italian
252
+ "浜辺に沈む美しい夕日", # Japanese
253
+ "해변 위로 아름다운 일몰", # Korean
254
+ ]
255
+
256
+ text_embeddings = model.encode(sentences=texts, task="text-matching")
257
+
258
+ # ========================
259
+ # 3. Code Understanding Task
260
+ # ========================
261
+
262
+ # Encode query
263
+ query_embedding = model.encode(
264
+ sentences=["Find a function that prints a greeting message to the console"],
265
+ task="code",
266
+ prompt_name="query",
267
  )
268
 
269
+ # Encode code
270
+ code_embeddings = model.encode(
271
+ sentences=["def hello_world():\n print('Hello, World!')"],
272
+ task="code",
273
+ prompt_name="passage",
274
+ )
275
+ ```
276
+ </details>
277
+
278
+
279
+ ## License
280
+
281
+ This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).
282
+
283
+
284
+ ## Contact
285
+
286
+ Join our [Discord community](https://discord.jina.ai) and chat with other community members about ideas.
287
+
288
+
289
+ ## Citation
290
 
291
+ If you find `jina-embeddings-v4` useful in your research, please cite the following paper: