VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

VDocRAG is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents by Tanaka et al. and first released in this repository.

Key Enhancements of VDocRAG:

New Pretraining Tasks: We propose novel self-supervised pre-training tasks (RCR and RCG) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents.
New Dataset: We introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.

Model Description

The model, NTT-hil-insight/VDocRetriever-Phi3-vision, was trained on QA pairs NTT-hil-insight/OpenDocVQA with the corpus NTT-hil-insight/OpenDocVQA-Corpus, for training VDocRAG with Vision Language Models (microsoft/Phi-3-vision-128k-instruct) in open-domain question answering scenarios.

NTT-hil-insight/VDocRetriever-Phi3-vision is a bi-encoder model designed to encode documents as unified image formats into dense vectors for document retrieval.

Usage

Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at VDocRAG/README.md. To run our full inference pipeline with a generator system, please use our code.

from PIL import Image
import requests
from io import BytesIO
from torch.nn.functional import cosine_similarity
import torch
from transformers import AutoProcessor
from vdocrag.vdocretriever.modeling import VDocRetriever

processor = AutoProcessor.from_pretrained('microsoft/Phi-3-vision-128k-instruct', trust_remote_code=True)
model = VDocRetriever.load('microsoft/Phi-3-vision-128k-instruct', 
                          lora_name_or_path='NTT-hil-insight/VDocRetriever-Phi3-vision', 
                          pooling='eos', 
                          normalize=True,
                          trust_remote_code=True, 
                          attn_implementation="flash_attention_2", 
                          torch_dtype=torch.bfloat16, 
                          use_cache=False).to('cuda:0')

# Process query inputs and get the embeddings
queries = ["Instruct: I’m looking for an image that answers the question.\nQuery: What is the total percentage of Palestinians residing at West Bank?</s>", 
           "Instruct: I’m looking for an image that answers the question.\nQuery: How many international visitors came to Japan in 2017?</s>"]
query_inputs = processor(queries, return_tensors="pt", padding="longest", max_length=256, truncation=True).to('cuda:0')

with torch.no_grad():
    model_output = model(query=query_inputs, use_cache=False)
    query_embeddings = model_output.q_reps

urls = [
    "https://huggingface.co/datasets/NTT-hil-insight/OpenDocVQA/resolve/main/image1.png",
    "https://huggingface.co/datasets/NTT-hil-insight/OpenDocVQA/resolve/main/image2.png"
]

doc_images = [Image.open(BytesIO(requests.get(url).content)).resize((1344, 1344)) for url in urls]

# Process images and get the embeddings
doc_prompt = "<|image_1|>\nWhat is shown in this image?</s>"
collated_list = [
    processor(doc_prompt, images=image, return_tensors="pt", padding="longest", max_length=4096, truncation=True).to('cuda:0') for image in doc_images
]

doc_inputs = {
    key: torch.stack([item[key][0] for item in collated_list], dim=0)
    for key in ['input_ids', 'attention_mask', 'pixel_values', 'image_sizes']
}

with torch.no_grad():
    model_output = model(document=doc_inputs, use_cache=False)
    doc_embeddings = model_output.p_reps

# Calculate cosine similarity
num_queries = query_embeddings.size(0)
num_passages = doc_embeddings.size(0)

for i in range(num_queries):
    query_embedding = query_embeddings[i].unsqueeze(0)
    similarities = cosine_similarity(query_embedding, doc_embeddings)
    print(f"Similarities for Query {i}: {similarities.cpu().float().numpy()}")

# >> Similarities for Query 0: [0.515625   0.38476562]
#    Similarities for Query 1: [0.37890625 0.5703125 ]

License

The models and weights of VDocRAG in this repo are released under the NTT License.

Citation

@inproceedings{tanaka2025vdocrag,
  author    = {Ryota Tanaka and
               Taichi Iki and
               Taku Hasegawa and
               Kyosuke Nishida and
               Kuniko Saito and
               Jun Suzuki},
  title     = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
  booktitle = {CVPR},
  year      = {2025}
}

NTT-hil-insight
/

VDocRetriever-Phi3-vision

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

Model Description

Usage

License

Citation

Model tree for NTT-hil-insight/VDocRetriever-Phi3-vision

Dataset used to train NTT-hil-insight/VDocRetriever-Phi3-vision

Collection including NTT-hil-insight/VDocRetriever-Phi3-vision

VDocRAG