LayoutLMv2

Multimodal (text + layout/format + image) pre-training for document AI

The documentation of this model in the Transformers library can be found here.

Introduction

LayoutLMv2 is an improved version of LayoutLM with new pre-training tasks to model the interaction among text, layout, and image in a single multi-modal framework. It outperforms strong baselines and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including , including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.834 → 0.852), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672).

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding Yang Xu, Yiheng Xu, Tengchao Lv, Lei Cui, Furu Wei, Guoxin Wang, Yijuan Lu, Dinei Florencio, Cha Zhang, Wanxiang Che, Min Zhang, Lidong Zhou, ACL 2021

Downloads last month: 604,382

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for microsoft/layoutlmv2-base-uncased

Finetunes

82 models

Spaces using microsoft/layoutlmv2-base-uncased 17

Collection including microsoft/layoutlmv2-base-uncased

LayoutLM

Collection

The LayoutLM series are Transformer encoders useful for document AI tasks such as invoice parsing, document image classification and DocVQA. • 6 items • Updated May 1, 2025 • 20

Paper for microsoft/layoutlmv2-base-uncased

LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document Understanding

Paper • 2012.14740 • Published Dec 29, 2020 • 3