File size: 6,559 Bytes
6fc683c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
# LayoutXLM (Document Foundation Model)
**Multimodal (text + layout/format + image) pre-training for multilingual [Document AI](https://www.microsoft.com/en-us/research/project/document-ai/)**

## Introduction

LayoutXLM is a multimodal pre-trained model for multilingual document understanding, which aims to bridge the language barriers for visually-rich document understanding. Experiment results show that it has significantly outperformed the existing SOTA cross-lingual pre-trained models on the XFUND dataset.

[LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding](https://arxiv.org/abs/2104.08836)
Yiheng Xu, Tengchao Lv, Lei Cui, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Furu Wei, arXiv Preprint 2021

## Models
`layoutxlm-base` | [huggingface](https://huggingface.co/microsoft/layoutxlm-base)

## Fine-tuning Example on [XFUND](https://github.com/doc-analysis/XFUND)

### Installation

Please refer to [layoutlmft](../layoutlmft/README.md)

### Fine-tuning for Semantic Entity Recognition

```
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_ser.py \
        --model_name_or_path microsoft/layoutxlm-base \
        --output_dir /tmp/test-ner \
        --do_train \
        --do_eval \
        --lang zh \
        --max_steps 1000 \
        --warmup_ratio 0.1 \
        --fp16
```

### Fine-tuning for Relation Extraction

```
cd layoutlmft
python -m torch.distributed.launch --nproc_per_node=4 examples/run_xfun_re.py \
        --model_name_or_path microsoft/layoutxlm-base \
        --output_dir /tmp/test-ner \
        --do_train \
        --do_eval \
        --lang zh \
        --max_steps 2500 \
        --per_device_train_batch_size 2 \
        --warmup_ratio 0.1 \
        --fp16
```

## Results on [XFUND](https://github.com/doc-analysis/XFUND)

###  Language-specific Finetuning

|                             | Model              | FUNSD      | ZH         | JA         | ES         | FR         | IT         | DE         | PT         | Avg.       |
| --------------------------- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| Semantic Entity Recognition | `xlm-roberta-base` | 0.667      | 0.8774     | 0.7761     | 0.6105     | 0.6743     | 0.6687     | 0.6814     | 0.6818     | 0.7047     |
|                             | `infoxlm-base`     | 0.6852     | 0.8868     | 0.7865     | 0.6230     | 0.7015     | 0.6751     | 0.7063     | 0.7008     | 0.7207     |
|                             | `layoutxlm-base`   | **0.794**  | **0.8924** | **0.7921** | **0.7550** | **0.7902** | **0.8082** | **0.8222** | **0.7903** | **0.8056** |
| Relation Extraction         | `xlm-roberta-base` | 0.2659     | 0.5105     | 0.5800     | 0.5295     | 0.4965     | 0.5305     | 0.5041     | 0.3982     | 0.4769     |
|                             | `infoxlm-base`     | 0.2920     | 0.5214     | 0.6000     | 0.5516     | 0.4913     | 0.5281     | 0.5262     | 0.4170     | 0.4910     |
|                             | `layoutxlm-base`   | **0.5483** | **0.7073** | **0.6963** | **0.6896** | **0.6353** | **0.6415** | **0.6551** | **0.5718** | **0.6432** |
### Zero-shot Transfer Learning

|     | Model              | FUNSD      | ZH         | JA         | ES         | FR         | IT         | DE         | PT         | Avg.       |
| --- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| SER | `xlm-roberta-base` | 0.667      | 0.4144     | 0.3023     | 0.3055     | 0.371      | 0.2767     | 0.3286     | 0.3936     | 0.3824     |
|     | `infoxlm-base`     | 0.6852     | 0.4408     | 0.3603     | 0.3102     | 0.4021     | 0.2880     | 0.3587     | 0.4502     | 0.4119     |
|     | `layoutxlm-base`   | **0.794**  | **0.6019** | **0.4715** | **0.4565** | **0.5757** | **0.4846** | **0.5252** | **0.539**  | **0.5561** |
| RE  | `xlm-roberta-base` | 0.2659     | 0.1601     | 0.2611     | 0.2440     | 0.2240     | 0.2374     | 0.2288     | 0.1996     | 0.2276     |
|     | `infoxlm-base`     | 0.2920     | 0.2405     | 0.2851     | 0.2481     | 0.2454     | 0.2193     | 0.2027     | 0.2049     | 0.2423     |
|     | `layoutxlm-base`   | **0.5483** | **0.4494** | **0.4408** | **0.4708** | **0.4416** | **0.4090** | **0.3820** | **0.3685** | **0.4388** |

### Multitask Fine-tuning



|     | Model              | FUNSD      | ZH         | JA         | ES         | FR         | IT         | DE         | PT         | Avg.       |
| --- | ------------------ | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- | ---------- |
| SER | `xlm-roberta-base` | 0.6633     | 0.883      | 0.7786     | 0.6223     | 0.7035     | 0.6814     | 0.7146     | 0.6726     | 0.7149     |
|     | `infoxlm-base`     | 0.6538     | 0.8741     | 0.7855     | 0.5979     | 0.7057     | 0.6826     | 0.7055     | 0.6796     | 0.7106     |
|     | `layoutxlm-base`   | **0.7924** | **0.8973** | **0.7964** | **0.7798** | **0.8173** | **0.821**  | **0.8322** | **0.8241** | **0.8201** |
| RE  | `xlm-roberta-base` | 0.3638     | 0.6797     | 0.6829     | 0.6828     | 0.6727     | 0.6937     | 0.6887     | 0.6082     | 0.6341     |
|     | `infoxlm-base`     | 0.3699     | 0.6493     | 0.6473     | 0.6828     | 0.6831     | 0.6690     | 0.6384     | 0.5763     | 0.6145     |
|     | `layoutxlm-base`   | **0.6671** | **0.8241** | **0.8142** | **0.8104** | **0.8221** | **0.8310** | **0.7854** | **0.7044** | **0.7823** |

## Citation

If you find LayoutXLM useful in your research, please cite the following paper:

``` latex
@article{Xu2020LayoutXLMMP,
  title         = {LayoutXLM: Multimodal Pre-training for Multilingual Visually-rich Document Understanding},
  author        = {Yiheng Xu and Tengchao Lv and Lei Cui and Guoxin Wang and Yijuan Lu and Dinei Florencio and Cha Zhang and Furu Wei},
  year          = {2021},
  eprint        = {2104.08836},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL}
}
```

## License

The content of this project itself is licensed under the [Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/)

### Contact Information

For help or issues using LayoutXLM, please submit a GitHub issue.

For other communications related to LayoutXLM, please contact Lei Cui (`[email protected]`), Furu Wei (`[email protected]`).