marco commited on
Commit
68a719e
·
verified ·
1 Parent(s): bbfaa75

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +68 -1
README.md CHANGED
@@ -25,7 +25,44 @@ The multilingual version is available [here](https://huggingface.co/llamaindex/v
25
 
26
  # Usage
27
 
28
- **Initialize model and processor**
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
29
 
30
  ```python
31
  from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
@@ -41,6 +78,7 @@ min_pixels = 1 * 28 * 28
41
  # Load the embedding model and processor
42
  model = Qwen2VLForConditionalGeneration.from_pretrained(
43
  'llamaindex/vdr-2b-v1',
 
44
  attn_implementation="flash_attention_2",
45
  torch_dtype=torch.bfloat16,
46
  device_map="cuda:0"
@@ -100,6 +138,7 @@ def encode_queries(queries: list[str], dimension: int) -> torch.Tensor:
100
  ```
101
 
102
  **Encode documents**
 
103
  ```python
104
  def round_by_factor(number: float, factor: int) -> int:
105
  return round(number / factor) * factor
@@ -162,6 +201,34 @@ def encode_documents(documents: list[Image.Image], dimension: int):
162
  return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
163
  ```
164
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
  # Training
166
 
167
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.
 
25
 
26
  # Usage
27
 
28
+ The model uses bf16 tensors and allocates ~4.4GB of VRAM when loaded. You can easily run inference and generate embeddings using 768 image patches and a batch size of 16 even on a cheap NVIDIA T4 GPU. This table reports the memory footprint (GB) under conditions of different batch sizes with HuggingFace Transformers and maximum 768 image patches.
29
+
30
+ | Batch Size | GPU Memory (GB) |
31
+ |------------|-----------------|
32
+ | 4 | 6.9 |
33
+ | 8 | 8.8 |
34
+ | 16 | 11.5 |
35
+ | 32 | 19.7 |
36
+
37
+ You can generate embeddings with this model in many different ways:
38
+
39
+ <details open>
40
+ <summary>
41
+ via LlamaIndex
42
+ </summary>
43
+
44
+ ```bash
45
+ pip install -U llama-index-embeddings-huggingface
46
+ ```
47
+
48
+ ```python
49
+ from llama_index.embeddings.huggingface import HuggingFaceEmbedding
50
+
51
+ model = HuggingFaceEmbedding(
52
+ model_name_or_path="llamaindex/vdr-2b-v1",
53
+ device="mps",
54
+ trust_remote_code=True,
55
+ )
56
+
57
+ embeddings = model.get_image_embedding("image.png")
58
+ ```
59
+
60
+ </details>
61
+
62
+ <details>
63
+ <summary>
64
+ via HuggingFace Transformers
65
+ </summary>
66
 
67
  ```python
68
  from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
 
78
  # Load the embedding model and processor
79
  model = Qwen2VLForConditionalGeneration.from_pretrained(
80
  'llamaindex/vdr-2b-v1',
81
+ # These are the recommended kwargs for the model, but change them as needed
82
  attn_implementation="flash_attention_2",
83
  torch_dtype=torch.bfloat16,
84
  device_map="cuda:0"
 
138
  ```
139
 
140
  **Encode documents**
141
+
142
  ```python
143
  def round_by_factor(number: float, factor: int) -> int:
144
  return round(number / factor) * factor
 
201
  return torch.nn.functional.normalize(embeddings[:, :dimension], p=2, dim=-1)
202
  ```
203
 
204
+ </details>
205
+
206
+
207
+ <details>
208
+ <summary>
209
+ via SentenceTransformers
210
+ </summary>
211
+
212
+ ```python
213
+ from sentence_transformers import SentenceTransformer
214
+
215
+ model = SentenceTransformer(
216
+ model_name_or_path="llamaindex/vdr-2b-v1",
217
+ device="mps",
218
+ trust_remote_code=True,
219
+ # These are the recommended kwargs for the model, but change them as needed
220
+ model_kwargs={
221
+ "torch_dtype": torch.bfloat16,
222
+ "device_map": "cuda:0",
223
+ "attn_implementation": "flash_attention_2"
224
+ },
225
+ )
226
+
227
+ embeddings = model.encode("image.png")
228
+ ```
229
+
230
+ </details>
231
+
232
  # Training
233
 
234
  The model is based on [MrLight/dse-qwen2-2b-mrl-v1](https://huggingface.co/MrLight/dse-qwen2-2b-mrl-v1) and it was trained on the new [vdr-multilingual-train](https://huggingface.co/datasets/llamaindex/vdr-multilingual-train) english subset that consinsists of 100k high quality samples. It was trained for 1 epoch using the [DSE approach](https://arxiv.org/abs/2406.11251), with a batch size of 128 and hard-mined negatives.