ICDAR 2019 competition on large-scale street view text with partial labeling-RRC-LSVT

J Kaur, W Singh - Multimedia Tools and Applications, 2022 - Springer

Object detection is one of the most fundamental and challenging tasks to locate objects in
images and videos. Over the past, it has gained much attention to do more research on …

被引用次数：140 相关文章所有 8 个版本

[PDF] arxiv.org

Text recognition in the wild: A survey

X Chen, L Jin, Y Zhu, C Luo, T Wang - ACM Computing Surveys (CSUR), 2021 - dl.acm.org

The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …

被引用次数：253 相关文章所有 5 个版本

[PDF] arxiv.org

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - Science China …, 2024 - Springer

In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

被引用次数：311 相关文章所有 2 个版本

[PDF] arxiv.org

Deepseek-vl: towards real-world vision-language understanding

H Lu, W Liu, B Zhang, B Wang, K Dong, B Liu… - arXiv preprint arXiv …, 2024 - arxiv.org

We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …

被引用次数：176 相关文章所有 4 个版本

[PDF] thecvf.com

Evalcrafter: Benchmarking and evaluating large video generation models

Y Liu, X Cun, X Liu, X Wang, Y Zhang… - Proceedings of the …, 2024 - openaccess.thecvf.com

The vision and language generative models have been overgrown in recent years. For
video generation various open-sourced models and public-available services have been …

被引用次数：86 相关文章所有 3 个版本

[PDF] arxiv.org

Scene text recognition with permuted autoregressive sequence models

D Bautista, R Atienza - European conference on computer vision, 2022 - Springer

Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …

被引用次数：198 相关文章所有 6 个版本

[PDF] arxiv.org

Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output

P Zhang, X Dong, Y Zang, Y Cao, R Qian… - arXiv preprint arXiv …, 2024 - arxiv.org

We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …

被引用次数：61 相关文章所有 3 个版本

[PDF] arxiv.org

Textdiffuser-2: Unleashing the power of language models for text rendering

J Chen, Y Huang, T Lv, L Cui, Q Chen, F Wei - European Conference on …, 2025 - Springer

The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …

被引用次数：43 相关文章所有 2 个版本

[PDF] thecvf.com

Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models

EM Bakr, P Sun, X Shen, FF Khan… - Proceedings of the …, 2023 - openaccess.thecvf.com

Designing robust text-to-image (T2I) models have been extensively explored in recent years,
especially with the emergence of diffusion models, which achieves state-of-the-art results on …

被引用次数：57 相关文章所有 7 个版本

[PDF] thecvf.com

Swintextspotter: Scene text spotting via better synergy between text detection and text recognition

M Huang, Y Liu, Z Peng, C Liu, D Lin… - proceedings of the …, 2022 - openaccess.thecvf.com

End-to-end scene text spotting has attracted great attention in recent years due to the
success of excavating the intrinsic synergy of the scene text detection and recognition …

被引用次数：137 相关文章所有 6 个版本