Docpedia: Unleashing the power of large multimodal model in the frequency domain for versatile document understanding
In this work, we present DocPedia, a novel large multimodal model (LMM) for versatile OCR-
free document understanding, capable of parsing images up to 2560× 2560 resolution …
free document understanding, capable of parsing images up to 2560× 2560 resolution …
Kosmos-2.5: A multimodal literate model
The automatic reading of text-intensive images represents a significant advancement toward
achieving Artificial General Intelligence (AGI). In this paper we present KOSMOS-2.5, a …
achieving Artificial General Intelligence (AGI). In this paper we present KOSMOS-2.5, a …
Transformers and language models in form understanding: A comprehensive review of scanned document analysis
This paper presents a comprehensive survey of research works on the topic of form
understanding in the context of scanned documents. We delve into recent advancements …
understanding in the context of scanned documents. We delve into recent advancements …
Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation
This paper presents a comprehensive evaluation of the Optical Character Recognition
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …
Textmonkey: An ocr-free large multimodal model for understanding document
Y Liu, B Yang, Q Liu, Z Li, Z Ma, S Zhang… - arXiv preprint arXiv …, 2024 - arxiv.org
We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks,
including document question answering (DocVQA) and scene text analysis. Our approach …
including document question answering (DocVQA) and scene text analysis. Our approach …
LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding
Recently leveraging large language models (LLMs) or multimodal large language models
(MLLMs) for document understanding has been proven very promising. However previous …
(MLLMs) for document understanding has been proven very promising. However previous …
OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition
Recently visually-situated text parsing (VsTP) has experienced notable advancements
driven by the increasing demand for automated document understanding and the …
driven by the increasing demand for automated document understanding and the …
Tps++: Attention-enhanced thin-plate spline for scene text recognition
Text irregularities pose significant challenges to scene text recognizers. Thin-Plate Spline
(TPS)-based rectification is widely regarded as an effective means to deal with them …
(TPS)-based rectification is widely regarded as an effective means to deal with them …
Deep Learning based Visually Rich Document Content Understanding: A Survey
Visually Rich Documents (VRDs) are essential in academia, finance, medical fields, and
marketing due to their multimodal information content. Traditional methods for extracting …
marketing due to their multimodal information content. Traditional methods for extracting …
Gridformer: Towards accurate table structure recognition via grid prediction
All tables can be represented as grids. Based on this observation, we propose GridFormer, a
novel approach for interpreting unconstrained table structures by predicting the vertex and …
novel approach for interpreting unconstrained table structures by predicting the vertex and …