Icdar2019 robust reading challenge on arbitrary-shaped text-rrc-art

X Chen, L Jin, Y Zhu, C Luo, T Wang - ACM Computing Surveys (CSUR), 2021 - dl.acm.org

The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …

被引用次数：253 相关文章所有 5 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：311 相关文章所有 2 个版本

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

被引用次数：176 相关文章所有 4 个版本

Text detection, recognition, and script identification in natural scene images: A Review

V Naosekpam, N Sahu - International Journal of Multimedia Information …, 2022 - Springer

Text in natural scene images plays a vital role in scene understanding. It contains a rich and
abundant amount of valuable semantic information useful in many applications such as …

被引用次数：32 相关文章

[PDF] arxiv.org

Scene text recognition with permuted autoregressive sequence models

D Bautista, R Atienza - European conference on computer vision, 2022 - Springer

Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …

被引用次数：198 相关文章所有 6 个版本

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

被引用次数：61 相关文章所有 3 个版本

[PDF] thecvf.com

Swintextspotter: Scene text spotting via better synergy between text detection and text recognition

M Huang, Y Liu, Z Peng, C Liu, D Lin… - proceedings of the …, 2022 - openaccess.thecvf.com

End-to-end scene text spotting has attracted great attention in recent years due to the
success of excavating the intrinsic synergy of the scene text detection and recognition …

被引用次数：137 相关文章所有 6 个版本

[PDF] thecvf.com

Abcnet: Real-time scene text spotting with adaptive bezier-curve network

Y Liu, H Chen, C Shen, T He, L Jin… - proceedings of the …, 2020 - openaccess.thecvf.com

Scene text detection and recognition has received increasing research attention. Existing
methods can be roughly categorized into two groups: character-based and segmentation …

被引用次数：451 相关文章所有 8 个版本

[PDF] thecvf.com

Turning a clip model into a scene text detector

W Yu, Y Liu, W Hua, D Jiang… - Proceedings of the …, 2023 - openaccess.thecvf.com

The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown
great potential in various downstream tasks via leveraging the pretrained vision and …

被引用次数：71 相关文章所有 7 个版本

[PDF] thecvf.com

Towards end-to-end unified scene text detection and layout analysis

S Long, S Qin, D Panteleev… - Proceedings of the …, 2022 - openaccess.thecvf.com

Scene text detection and document layout analysis have long been treated as two separate
tasks in different image domains. In this paper, we bring them together and introduce the …

被引用次数：98 相关文章所有 8 个版本