Update README.md
Browse files
README.md
CHANGED
@@ -25,7 +25,44 @@ The multilingual version is available [here](https://huggingface.co/llamaindex/v
|
|
25 |
|
26 |
# Usage
|
27 |
|
28 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
29 |
|
30 |
```python
|
31 |
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
|
@@ -41,6 +78,7 @@ min_pixels = 1 * 28 * 28
|
|
41 |
# Load the embedding model and processor
|
42 |
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
43 |
'llamaindex/vdr-2b-v1',
|
|
|
44 |
attn_implementation="flash_attention_2",
|
45 |
torch_dtype=torch.bfloat16,
|
46 |
device_map="cuda:0"
|
@@ -100,6 +138,7 @@ def encode_queries(queries: list[str], dimension: int) -> torch.Tensor:
|
|
100 |
```
|
101 |
|
102 |
**Encode documents**
|
|
|
103 |
```python
|
104 |
def round_by_factor(number: float, factor: int) -> int:
|
105 |
return round(number / factor) * factor
|
@@ -162,6 +201,34 @@ def encode_documents(documents: list[Image.Image], dimension: int):
|
|
162 |
return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
|
163 |
```
|
164 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
165 |
# Training
|
166 |
|
167 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|
|
|
25 |
|
26 |
# Usage
|
27 |
|
28 |
+
The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
|
29 |
+
|
30 |
+
| Batch Size | GPU Memory (GB) |
|
31 |
+
|------------|-----------------|
|
32 |
+
| 4 | 6.9 |
|
33 |
+
| 8 | 8.8 |
|
34 |
+
| 16 | 11.5 |
|
35 |
+
| 32 | 19.7 |
|
36 |
+
|
37 |
+
You can generate embeddings with this model in many different ways:
|
38 |
+
|
39 |
+
<details open>
|
40 |
+
<summary>
|
41 |
+
via LlamaIndex
|
42 |
+
</summary>
|
43 |
+
|
44 |
+
```bash
|
45 |
+
pip install -U llama-index-embeddings-huggingface
|
46 |
+
```
|
47 |
+
|
48 |
+
```python
|
49 |
+
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
|
50 |
+
|
51 |
+
model = HuggingFaceEmbedding(
|
52 |
+
model_name_or_path="llamaindex/vdr-2b-v1",
|
53 |
+
device="mps",
|
54 |
+
trust_remote_code=True,
|
55 |
+
)
|
56 |
+
|
57 |
+
embeddings = model.get_image_embedding("image.png")
|
58 |
+
```
|
59 |
+
|
60 |
+
</details>
|
61 |
+
|
62 |
+
<details>
|
63 |
+
<summary>
|
64 |
+
via HuggingFace Transformers
|
65 |
+
</summary>
|
66 |
|
67 |
```python
|
68 |
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
|
|
|
78 |
# Load the embedding model and processor
|
79 |
model = Qwen2VLForConditionalGeneration.from_pretrained(
|
80 |
'llamaindex/vdr-2b-v1',
|
81 |
+
# These are the recommended kwargs for the model, but change them as needed
|
82 |
attn_implementation="flash_attention_2",
|
83 |
torch_dtype=torch.bfloat16,
|
84 |
device_map="cuda:0"
|
|
|
138 |
```
|
139 |
|
140 |
**Encode documents**
|
141 |
+
|
142 |
```python
|
143 |
def round_by_factor(number: float, factor: int) -> int:
|
144 |
return round(number / factor) * factor
|
|
|
201 |
return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
|
202 |
```
|
203 |
|
204 |
+
</details>
|
205 |
+
|
206 |
+
|
207 |
+
<details>
|
208 |
+
<summary>
|
209 |
+
via SentenceTransformers
|
210 |
+
</summary>
|
211 |
+
|
212 |
+
```python
|
213 |
+
from sentence_transformers import SentenceTransformer
|
214 |
+
|
215 |
+
model = SentenceTransformer(
|
216 |
+
model_name_or_path="llamaindex/vdr-2b-v1",
|
217 |
+
device="mps",
|
218 |
+
trust_remote_code=True,
|
219 |
+
# These are the recommended kwargs for the model, but change them as needed
|
220 |
+
model_kwargs={
|
221 |
+
"torch_dtype": torch.bfloat16,
|
222 |
+
"device_map": "cuda:0",
|
223 |
+
"attn_implementation": "flash_attention_2"
|
224 |
+
},
|
225 |
+
)
|
226 |
+
|
227 |
+
embeddings = model.encode("image.png")
|
228 |
+
```
|
229 |
+
|
230 |
+
</details>
|
231 |
+
|
232 |
# Training
|
233 |
|
234 |
The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
|