jinaai
/

jina-embeddings-v4

Visual Document Retrieval

sentence-transformers

feature-extraction

multimodal-embedding

multilingual-embedding

Text-to-Visual Document (T→VD) retrieval

sentence-similarity

🇪🇺 Region: EU

Model card Files Files and versions Community

jupyterjazz commited on 27 days ago

Commit

a8a6bf2

·

1 Parent(s): fae0273

docs: update vdr info

Signed-off-by: jupyterjazz <[email protected]>

Files changed (2) hide show

README.md +4 -0
vidore_eval.md +0 -26

README.md CHANGED Viewed

@@ -308,6 +308,10 @@ code_embeddings = model.encode(
 </details>
 ## License
 This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).

 </details>
+## Jina-VDR
+We’re releasing Jina VDR, a multilingual, multi-domain benchmark for visual document retrieval, alongside jina-embeddings-v4. The task collection can be viewed [here](https://huggingface.co/collections/jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449), and evaluation instructions can be found [here](https://github.com/jina-ai/jina-vdr).
 ## License
 This model is licensed to download and run under [CC BY-NC 4.0](https://creativecommons.org/licenses/by-nc/4.0/deed.en). It is available for commercial use via the [Jina Embeddings API](https://jina.ai/embeddings/), [AWS](https://longdogechallenge.com/), [Azure](https://longdogechallenge.com/), and [GCP](https://longdogechallenge.com/). To download for commercial use, please [contact us](https://jina.ai/contact-sales).

vidore_eval.md DELETED Viewed

@@ -1,26 +0,0 @@
-# How to run the Vidore Evaluation
-If you want to run the vidore evaluation on the jina-embeddings-v4 model (and on the Document Retrieval Benchmark curated by Jina AI), you need to install requirements in [this fork/branch](https://github.com/jina-ai/vidore-benchmark-fork/tree/feat-add-jina-embeddings) (these changes should be merged with the source code of Vidore soon).
-```
-pip install vidore-benchmark[jina-v4]
-```
-You can run the evaluation with the following command:
-```
-vidore-benchmark evaluate-retriever \
-    --model-class jev4 \
-    --model-name jinaai/jina-embeddings-v4 \
-    --collection-name jinaai/jinavdr-visual-document-retrieval-684831c022c53b21c313b449 \
-    --dataset-format qa \
-    --split test
-```
-## Evaluate Pure Text Retrieval Models on Refined Vidore Tasks
-The original Vidore dataset contain multiple text chunks per image to evaluate text retrieval models on them.
-Those text chunks are  extracted from the document pages using different tools like [Unstructured](https://github.com/Unstructured-IO/unstructured), OCR models, and LLMs.
-For evaluating text retrieval models on our filtered versions of the Vidore datasets, you can use the datasets in the collection `https://huggingface.co/collections/jinaai/jina-vdr-vidoreocr-tasks-6852cfc55ccf837e7fecfa1b`.
-It is also possible to evaluate jina-embeddings-v4 and other vision retrieval models on them. This however takes more time and should lead to the same evaluation results as running the vesions of the datasets in the Jina VDR collection.