Layoutlmv2: Multi-modal pre-training for visually-rich document understanding

P Xu, X Zhu, DA Clifton - IEEE Transactions on Pattern Analysis …, 2023 - ieeexplore.ieee.org

Transformer is a promising neural network learner, and has achieved great success in
various machine learning tasks. Thanks to the recent prevalence of multimodal applications …

被引用次数：361 相关文章所有 9 个版本

[PDF] mlr.press

Pix2struct: Screenshot parsing as pretraining for visual language understanding

K Lee, M Joshi, IR Turc, H Hu, F Liu… - International …, 2023 - proceedings.mlr.press

Visually-situated language is ubiquitous—sources range from textbooks with diagrams to
web pages with images and tables, to mobile apps with buttons and forms. Perhaps due to …

被引用次数：140 相关文章所有 7 个版本

[PDF] arxiv.org

Layoutlmv3: Pre-training for document ai with unified text and image masking

Y Huang, T Lv, L Cui, Y Lu, F Wei - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Self-supervised pre-training techniques have achieved remarkable progress in Document
AI. Most multimodal pre-trained models use a masked language modeling objective to learn …

被引用次数：312 相关文章所有 3 个版本

[PDF] arxiv.org

Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models

P Xu, W Shao, K Zhang, P Gao, S Liu, M Lei… - arXiv preprint arXiv …, 2023 - arxiv.org

Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …

被引用次数：101 相关文章所有 3 个版本

[PDF] arxiv.org

Ocr-free document understanding transformer

G Kim, T Hong, M Yim, JY Nam, J Park, J Yim… - … on Computer Vision, 2022 - Springer

Understanding document images (eg, invoices) is a core but challenging task since it
requires complex functions such as reading text and a holistic understanding of the …

被引用次数：200 相关文章所有 6 个版本

[PDF] thecvf.com

Docformer: End-to-end transformer for document understanding

S Appalaraju, B Jasani, BU Kota… - Proceedings of the …, 2021 - openaccess.thecvf.com

We present DocFormer-a multi-modal transformer based architecture for the task of Visual
Document Understanding (VDU). VDU is a challenging problem which aims to understand …

被引用次数：240 相关文章所有 6 个版本

[PDF] thecvf.com

Unifying vision, text, and layout for universal document processing

Z Tang, Z Yang, G Wang, Y Fang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract We propose Universal Document Processing (UDOP), a foundation Document AI
model which unifies text, image, and layout modalities together with varied task formats …

被引用次数：62 相关文章所有 6 个版本

[PDF] arxiv.org

Dit: Self-supervised pre-training for document image transformer

J Li, Y Xu, T Lv, L Cui, C Zhang, F Wei - Proceedings of the 30th ACM …, 2022 - dl.acm.org

Image Transformer has recently achieved significant progress for natural image
understanding, either using supervised (ViT, DeiT, etc.) or self-supervised (BEiT, MAE, etc.) …

被引用次数：120 相关文章所有 4 个版本

[HTML] nature.com Full View

[HTML][HTML] A deep-learning system bridging molecule structure and biomedical text with comprehension comparable to human professionals

Z Zeng, Y Yao, Z Liu, M Sun - Nature communications, 2022 - nature.com

To accelerate biomedical research process, deep-learning systems are developed to
automatically acquire knowledge about molecule entities by reading large-scale biomedical …

被引用次数：89 相关文章所有 10 个版本

[PDF] arxiv.org

Lilt: A simple yet effective language-independent layout transformer for structured document understanding

J Wang, L Jin, K Ding - arXiv preprint arXiv:2202.13669, 2022 - arxiv.org

Structured document understanding has attracted considerable attention and made
significant progress recently, owing to its crucial role in intelligent document processing …

被引用次数：107 相关文章所有 5 个版本