VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents
VDocRAG is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents by Tanaka et al. and first released in this repository.
Key Enhancements of VDocRAG:
- New Pretraining Tasks: We propose novel self-supervised pre-training tasks (RCR and RCG) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents.
- New Dataset: We introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.
Model Description
The model, NTT-hil-insight/VDocGenerator-Phi3-vision
, was trained on QA pairs NTT-hil-insight/OpenDocVQA
with the corpus NTT-hil-insight/OpenDocVQA-Corpus
, for training VDocRAG with Vision Language Models (microsoft/Phi-3-vision-128k-instruct
) in open-domain question answering scenarios.
NTT-hil-insight/VDocGenerator-Phi3-vision
is an autoregressive model designed to generate answers based on retrieved document images.
Usage
Here, we show an easy way to download our model from HuggingFace Hub and run it quickly. Make sure to install the dependencies and packages as following at VDocRAG/README.md. To run our full inference pipeline with a retrieval system, please use our code.
from PIL import Image
import requests
from io import BytesIO
from torch.nn.functional import cosine_similarity
import torch
from transformers import AutoProcessor
from vdocrag.vdocgenerator.modeling import VDocGenerator
model = VDocGenerator.load('microsoft/Phi-3-vision-128k-instruct',
lora_name_or_path='NTT-hil-insight/VDocGenerator-Phi3-vision',
trust_remote_code=True,
attn_implementation="flash_attention_2",
torch_dtype=torch.bfloat16,
use_cache=False).to('cuda:0')
# Process images with the prompt
query = "How many international visitors came to Japan in 2017? \n Answer briefly."
image_tokens = "\n".join([f"<|image_{i+1}|>" for i in range(len(doc_images))])
messages = [{"role": "user", "content": f"{image_tokens}\n{query}"}]
prompt = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
processed = processor(prompt, images=doc_images, return_tensors="pt").to('cuda:0')
# Generate the answer
generate_ids = model.generate(processed,
generation_args={
"max_new_tokens": 64,
"temperature": 0.0,
"do_sample": False,
"eos_token_id": processor.tokenizer.eos_token_id
})
generate_ids = generate_ids[:, processed['input_ids'].shape[1]:]
response = processor.batch_decode(generate_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False)[0].strip()
print("Model prediction: {0}".format(response))
# >> Model prediction: 28.69m
License
The models and weights of VDocRAG in this repo are released under the NTT License.
Citation
@inproceedings{tanaka2025vdocrag,
author = {Ryota Tanaka and
Taichi Iki and
Taku Hasegawa and
Kyosuke Nishida and
Kuniko Saito and
Jun Suzuki},
title = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
booktitle = {CVPR},
year = {2025}
}
- Downloads last month
- 27
Model tree for NTT-hil-insight/VDocGenerator-Phi3-vision
Base model
microsoft/Phi-3-vision-128k-instruct