PEFT
Safetensors
English

VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents

VDocRAG is a new RAG framework that can directly understand diverse real-world documents purely from visual features. It was introduced in the paper VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents by Tanaka et al. and first released in this repository.

Key Enhancements of VDocRAG:

  • New Pretraining Tasks: We propose novel self-supervised pre-training tasks (RCR and RCG) that adapt large vision-language models for retrieval by compressing visual information into dense token representations while aligning them with textual content in documents.
  • New Dataset: We introduce OpenDocVQA, the first unified collection of open-domain document visual question answering datasets, encompassing diverse document types and formats.

Model Description

The model, NTT-hil-insight/VDocRetriever-Phi3-vision-pretrained, was pretrained on only OCR-image pairs NTT-hil-insight/VDocRetriever-Pretrain-DocStruct, for pre-training VDocRetriever with Vision Language Models (microsoft/Phi-3-vision-128k-instruct).

NTT-hil-insight/VDocRetriever-Phi3-vision is a bi-encoder model designed to encode documents as unified image formats into dense vectors for document retrieval.

License

The models and weights of VDocRAG in this repo are released under the NTT License.

Citation

@inproceedings{tanaka2025vdocrag,
  author    = {Ryota Tanaka and
               Taichi Iki and
               Taku Hasegawa and
               Kyosuke Nishida and
               Kuniko Saito and
               Jun Suzuki},
  title     = {VDocRAG: Retrieval-Augmented Generation over Visually-Rich Documents},
  booktitle = {CVPR},
  year      = {2025}
}
Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for NTT-hil-insight/VDocRetriever-Phi3-vision-pretrained

Adapter
(6)
this model

Dataset used to train NTT-hil-insight/VDocRetriever-Phi3-vision-pretrained

Collection including NTT-hil-insight/VDocRetriever-Phi3-vision-pretrained