# XDoc
## Introduction
XDoc is a unified pre-trained model that deals with different document formats in a single model. With only 36.7% parameters, XDoc achieves comparable or better performance on downstream tasks, which is cost-effective for real-world deployment.
[XDoc: Unified Pre-training for Cross-Format Document Understanding](https://arxiv.org/abs/2210.02849)
Jingye Chen, Tengchao Lv, Lei Cui, Cha Zhang, Furu Wei, [EMNLP 2022](#)
The overview of our framework is as follows:
## Download
### Pre-trained Model
| Model | Download |
| -------- | -------- |
| xdoc-pretrain-roberta-1M | [xdoc-base](https://huggingface.co/microsoft/xdoc-base) |
### Fine-tuning Models
| Model | Download |
| -------- | -------- |
| xdoc-squad1.1 | [xdoc-squad1.1](https://huggingface.co/microsoft/xdoc-base-squad1.1) |
| xdoc-squad2.0 | [xdoc-squad2.0](https://huggingface.co/microsoft/xdoc-base-squad2.0) |
| xdoc-funsd | [xdoc-funsd](https://huggingface.co/microsoft/xdoc-base-funsd) |
| xdoc-websrc | [xdoc-websrc](https://huggingface.co/microsoft/xdoc-base-websrc) |
## Fine-tune
### SQuAD
The dataset will be **automatically downloaded**. Please refer to ```./fine_tuning/squad/```.
#### Installation
```
pip install -r requirements.txt
```
#### Train
To train XDoc on SQuADv1.1
```bash
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad \
--do_train \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v1_result \
--overwrite_output_dir
```
To train XDoc on SQuADv2.0
```bash
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base \
--dataset_name squad_v2 \
--do_train \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./v2_result \
--overwrite_output_dir
```
#### Test
To test XDoc on SQuADv1.1
```bash
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad1.1 \
--dataset_name squad \
--do_eval \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 2 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv1.1_result \
--overwrite_output_dir
```
To test XDoc on SQuADv2.0
```bash
CUDA_VISIBLE_DEVICES=0 python run_squad.py \
--model_name_or_path microsoft/xdoc-base-squad2.0 \
--dataset_name squad_v2 \
--do_eval \
--version_2_with_negative \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--num_train_epochs 4 \
--max_seq_length 384 \
--doc_stride 128 \
--output_dir ./squadv2.0_result \
--overwrite_output_dir
```
### FUNSD
The dataset will be **automatically downloaded**. Please refer to ```./fine_tuning/funsd/```.
#### Installation
```bash
pip install -r requirements.txt
```
Also, you need to install ```detectron2```. For example, if you use torch1.8 with cuda version 10.1, you can use the following command
```bash
pip install detectron2 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu101/torch1.8/index.html
```
#### Train
```bash
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base \
--output_dir camera_ready_funsd_1M \
--do_train \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
```
#### Test
```
CUDA_VISIBLE_DEVICES=0 python -m torch.distributed.launch --nproc_per_node=1 --master_port 5678 run_funsd.py \
--model_name_or_path microsoft/xdoc-base-funsd \
--output_dir camera_ready_funsd_1M \
--do_eval \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16 \
--overwrite_output_dir \
--seed 42
```
### WebSRC
The dataset will be **manually downloaded**. After downloading, please modify the argument ```--web_train_file```, ```--web_eval_file```, ```web_root_dir```, and ```root_dir``` in args.py.
#### Installation
```bash
pip install -r requirements.txt
```
#### Train
```bash
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train True --do_eval True --model_name_or_path microsoft/xdoc-base
```
#### Test
```bash
CUDA_VISIBLE_DEVICES=0 python run_docvqa.py --do_train False --do_eval True --model_name_or_path microsoft/xdoc-base-websrc
```
## Result
* To verify the model accuracy, we select the GLUE benchmark and SQuAD to evaluate plain text understanding, FUNSD and DocVQA to evaluate doc-
ument understanding, and WebSRC for web text understanding. Experimental
results have demonstrated that XDoc achieves comparable or even better performance on these tasks.
| Model | MNLI-m | QNLI | SST2 | MRPC | SQUAD1.1/2.0 | FUNSD | DocVQA | WebSRC |
| :----------: | :------: | :----: | :----: | :----: | :------------: | :-----: | :------: | :------: |
| RoBERTa | **87.6** | **92.8** | 94.8 | 90.2 | **92.2**/83.4 | - | - | - |
| LayoutLM | - | - | - | - | - | 79.3 | 69.2 | - |
| MarkupLM | - | - | - | - | - | - | - | 74.5 |
| **XDoc(Ours)** | 86.8 | 92.3 | **95.3** | **91.1** | 92.0/**83.5** | **89.4** | **72.7** | **74.8** |
* With only 36.7% parameters, XDoc achieves comparable or even better performance on a variety of downstream tasks compared with the individual pre-trained models, which is cost effective for real-world deployment.
| Model | Word | 1D Position | Transformer | 2D Position | XPath | Adaptive | Total |
| :----------: | :----: | :-----------: | :-----------: | :-----------: | :-----: | :--------: | :-----: |
| RoBERTa | √ | √ | √ | - | - | - | 128M |
| LayoutLM | √ | √ | √ | √ | - | - | 131M |
| MarkupLM | √ | √ | √ | - | √ | - | 139M |
| **XDoc(Ours)** | √ | √ | √ | √ | √ | √ | 146M |
## Citation
If you find XDoc helpful, please cite us:
```
@article{chen2022xdoc,
title={XDoc: Unified Pre-training for Cross-Format Document Understanding},
author={Chen, Jingye and Lv, Tengchao and Cui, Lei and Zhang, Cha and Wei, Furu},
journal={arXiv preprint arXiv:2210.02849},
year={2022}
}
```
## License
This project is licensed under the license found in the LICENSE file in the root directory of this source tree.
Portions of the source code are based on the [transformers](https://github.com/huggingface/transformers).
[Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct)
## Contact
For help or issues using XDoc, please submit a GitHub issue.
For other communications, please contact Lei Cui (`lecu@microsoft.com`), Furu Wei (`fuwei@microsoft.com`).