llamaindex
/

vdr-2b-v1

@@ -25,7 +25,44 @@ The multilingual version is available [here](https://huggingface.co/llamaindex/v
 # Usage
-**Initialize model and processor**
 ```python
 from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
@@ -41,6 +78,7 @@ min_pixels = 1 * 28 * 28
 # Load the embedding model and processor
 model = Qwen2VLForConditionalGeneration.from_pretrained(
     'llamaindex/vdr-2b-v1',
     attn_implementation="flash_attention_2",
     torch_dtype=torch.bfloat16,
     device_map="cuda:0"
@@ -100,6 +138,7 @@ def encode_queries(queries: list[str], dimension: int) -> torch.Tensor:
 ```
 **Encode documents**
 ```python
 def round_by_factor(number: float, factor: int) -> int:
     return round(number / factor) * factor
@@ -162,6 +201,34 @@ def encode_documents(documents: list[Image.Image], dimension: int):
     return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
 ```
 # Training
 The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.

 # Usage
+The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
+| Batch Size | GPU Memory (GB) |
+|------------|-----------------|
+|          4 |             6.9 |
+|          8 |             8.8 |
+|         16 |            11.5 |
+|         32 |            19.7 |
+You can generate embeddings with this model in many different ways:
+<details open>
+<summary>
+via LlamaIndex
+</summary>
+```bash
+pip install -U llama-index-embeddings-huggingface
+```
+```python
+from llama_index.embeddings.huggingface import HuggingFaceEmbedding
+model = HuggingFaceEmbedding(
+    model_name_or_path="llamaindex/vdr-2b-v1",
+    device="mps",
+    trust_remote_code=True,
+)
+embeddings = model.get_image_embedding("image.png")
+```
+</details>
+<details>
+<summary>
+via HuggingFace Transformers
+</summary>
 ```python
 from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
 # Load the embedding model and processor
 model = Qwen2VLForConditionalGeneration.from_pretrained(
     'llamaindex/vdr-2b-v1',
+    # These are the recommended kwargs for the model, but change them as needed
     attn_implementation="flash_attention_2",
     torch_dtype=torch.bfloat16,
     device_map="cuda:0"
 ```
 **Encode documents**
 ```python
 def round_by_factor(number: float, factor: int) -> int:
     return round(number / factor) * factor
     return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
 ```
+</details>
+<details>
+<summary>
+via SentenceTransformers
+</summary>
+```python
+from sentence_transformers import SentenceTransformer
+model = SentenceTransformer(
+    model_name_or_path="llamaindex/vdr-2b-v1",
+    device="mps",
+    trust_remote_code=True,
+    # These are the recommended kwargs for the model, but change them as needed
+    model_kwargs={
+        "torch_dtype": torch.bfloat16,
+        "device_map": "cuda:0",
+        "attn_implementation": "flash_attention_2"
+    },
+)
+embeddings = model.encode("image.png")
+```
+</details>
 # Training
 The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.