Spaces:
Sleeping
Sleeping
File size: 6,559 Bytes
6fc683c |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 |
# LayoutXLM (Document Foundation Model)
**Multimodal (text + layout/format + image) pre-training for multilingual [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/)**
## Introduction
LayoutXLM is a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. Experiment results show that it has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset.
[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836)
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei, arXiv Preprint 2021
## Models
`layoutxlm-base` | [huggingface](https://huggingface.co/microsoft/layoutxlm-base)
## Fine-tuning Example on [XFUND](https://github.com/doc-analysis/XFUND)
### Installation
Please refer to [layoutlmft](../layoutlmft/README.md)
### Fine-tuning for Semantic Entity Recognition
```
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_ser.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 1000 \
--warmup_ratio 0.1 \
--fp16
```
### Fine-tuning for Relation Extraction
```
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re.py \
--model_name_or_path microsoft/layoutxlm-base \
--output_dir /tmp/test-ner \
--do_train \
--do_eval \
--lang zh \
--max_steps 2500 \
--per_device_train_batch_size 2 \
--warmup_ratio 0.1 \
--fp16
```
## Results on [XFUND](https://github.com/doc-analysis/XFUND)
### Language-specific Finetuning
| | Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. |
| --------------------------- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Semantic Entity Recognition | `xlm-roberta-base` | 0.667 | 0.8774 | 0.7761 | 0.6105 | 0.6743 | 0.6687 | 0.6814 | 0.6818 | 0.7047 |
| | `infoxlm-base` | 0.6852 | 0.8868 | 0.7865 | 0.6230 | 0.7015 | 0.6751 | 0.7063 | 0.7008 | 0.7207 |
| | `layoutxlm-base` | **0.794** | **0.8924** | **0.7921** | **0.7550** | **0.7902** | **0.8082** | **0.8222** | **0.7903** | **0.8056** |
| Relation Extraction | `xlm-roberta-base` | 0.2659 | 0.5105 | 0.5800 | 0.5295 | 0.4965 | 0.5305 | 0.5041 | 0.3982 | 0.4769 |
| | `infoxlm-base` | 0.2920 | 0.5214 | 0.6000 | 0.5516 | 0.4913 | 0.5281 | 0.5262 | 0.4170 | 0.4910 |
| | `layoutxlm-base` | **0.5483** | **0.7073** | **0.6963** | **0.6896** | **0.6353** | **0.6415** | **0.6551** | **0.5718** | **0.6432** |
### Zero-shot Transfer Learning
| | Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. |
| --- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| SER | `xlm-roberta-base` | 0.667 | 0.4144 | 0.3023 | 0.3055 | 0.371 | 0.2767 | 0.3286 | 0.3936 | 0.3824 |
| | `infoxlm-base` | 0.6852 | 0.4408 | 0.3603 | 0.3102 | 0.4021 | 0.2880 | 0.3587 | 0.4502 | 0.4119 |
| | `layoutxlm-base` | **0.794** | **0.6019** | **0.4715** | **0.4565** | **0.5757** | **0.4846** | **0.5252** | **0.539** | **0.5561** |
| RE | `xlm-roberta-base` | 0.2659 | 0.1601 | 0.2611 | 0.2440 | 0.2240 | 0.2374 | 0.2288 | 0.1996 | 0.2276 |
| | `infoxlm-base` | 0.2920 | 0.2405 | 0.2851 | 0.2481 | 0.2454 | 0.2193 | 0.2027 | 0.2049 | 0.2423 |
| | `layoutxlm-base` | **0.5483** | **0.4494** | **0.4408** | **0.4708** | **0.4416** | **0.4090** | **0.3820** | **0.3685** | **0.4388** |
### Multitask Fine-tuning
| | Model | FUNSD | ZH | JA | ES | FR | IT | DE | PT | Avg. |
| --- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| SER | `xlm-roberta-base` | 0.6633 | 0.883 | 0.7786 | 0.6223 | 0.7035 | 0.6814 | 0.7146 | 0.6726 | 0.7149 |
| | `infoxlm-base` | 0.6538 | 0.8741 | 0.7855 | 0.5979 | 0.7057 | 0.6826 | 0.7055 | 0.6796 | 0.7106 |
| | `layoutxlm-base` | **0.7924** | **0.8973** | **0.7964** | **0.7798** | **0.8173** | **0.821** | **0.8322** | **0.8241** | **0.8201** |
| RE | `xlm-roberta-base` | 0.3638 | 0.6797 | 0.6829 | 0.6828 | 0.6727 | 0.6937 | 0.6887 | 0.6082 | 0.6341 |
| | `infoxlm-base` | 0.3699 | 0.6493 | 0.6473 | 0.6828 | 0.6831 | 0.6690 | 0.6384 | 0.5763 | 0.6145 |
| | `layoutxlm-base` | **0.6671** | **0.8241** | **0.8142** | **0.8104** | **0.8221** | **0.8310** | **0.7854** | **0.7044** | **0.7823** |
## Citation
If you find LayoutXLM useful in your research, please cite the following paper:
``` latex
@article{Xu2020LayoutXLMMP,
title = {LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding},
author = {Yiheng Xu and Tengchao Lv and Lei Cui and Guoxin Wang and Yijuan Lu and Dinei Florencio and Cha Zhang and Furu Wei},
year = {2021},
eprint = {2104.08836},
archivePrefix = {arXiv},
primaryClass = {cs.CL}
}
```
## License
The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)
### Contact Information
For help or issues using LayoutXLM, please submit a GitHub issue.
For other communications related to LayoutXLM, please contact Lei Cui (`[email protected]`), Furu Wei (`[email protected]`).
|