Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding

H Feng, Q Liu, H Liu, J Tang, W Zhou, H Li… - Science China …, 2024 - Springer
In this work, we present DocPedia, a novel large multimodal model (LMM) for versatile OCR-
free document understanding, capable of parsing images up to 2560× 2560 resolution …

Kosmos-2.5: A multimodal literate model

T Lv, Y Huang, J Chen, Y Zhao, Y Jia, L Cui… - arXiv preprint arXiv …, 2023 - arxiv.org
The automatic reading of text-intensive images represents a significant advancement toward
achieving Artificial General Intelligence (AGI). In this paper we present KOSMOS-2.5, a …

Transformers and language models in form understanding: A comprehensive review of scanned document analysis

A Abdallah, D Eberharter, Z Pfister, A Jatowt - arXiv preprint arXiv …, 2024 - arxiv.org
This paper presents a comprehensive survey of research works on the topic of form
understanding in the context of scanned documents. We delve into recent advancements …

Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation

Y Shi, D Peng, W Liao, Z Lin, X Chen, C Liu… - arXiv preprint arXiv …, 2023 - arxiv.org
This paper presents a comprehensive evaluation of the Optical Character Recognition
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …

Textmonkey: An ocr-free large multimodal model for understanding document

Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks,
including document question answering (DocVQA) and scene text analysis. Our approach …

LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding

C Luo, Y Shen, Z Zhu, Q Zheng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently leveraging large language models (LLMs) or multimodal large language models
(MLLMs) for document understanding has been proven very promising. However previous …

OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition

J Wan, S Song, W Yu, Y Liu, W Cheng… - Proceedings of the …, 2024 - openaccess.thecvf.com
Recently visually-situated text parsing (VsTP) has experienced notable advancements
driven by the increasing demand for automated document understanding and the …

Tps++: Attention-enhanced thin-plate spline for scene text recognition

T Zheng, Z Chen, J Bai, H Xie, YG Jiang - arXiv preprint arXiv:2305.05322, 2023 - arxiv.org
Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline
(TPS)-based rectification is widely regarded as an effective means to deal with them …

Deep Learning based Visually Rich Document Content Understanding: A Survey

Y Ding, J Lee, SC Han - arXiv preprint arXiv:2408.01287, 2024 - arxiv.org
Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and
marketing due to their multimodal information content. Traditional methods for extracting …

Gridformer: Towards accurate table structure recognition via grid prediction

P Lyu, W Ma, H Wang, Y Yu, C Zhang, K Yao… - Proceedings of the 31st …, 2023 - dl.acm.org
All tables can be represented as grids. Based on this observation, we propose GridFormer, a
novel approach for interpreting unconstrained table structures by predicting the vertex and …