Tools, techniques, datasets and application areas for object detection in an image: a review
J Kaur, W Singh - Multimedia Tools and Applications, 2022 - Springer
Object detection is one of the most fundamental and challenging tasks to locate objects in
images and videos. Over the past, it has gained much attention to do more research on …
images and videos. Over the past, it has gained much attention to do more research on …
Text recognition in the wild: A survey
The history of text can be traced back over thousands of years. Rich and precise semantic
information carried by text is important in a wide range of vision-based application …
information carried by text is important in a wide range of vision-based application …
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
In this paper, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
(MLLM) to bridge the capability gap between open-source and proprietary commercial …
Deepseek-vl: towards real-world vision-language understanding
We present DeepSeek-VL, an open-source Vision-Language (VL) Model designed for real-
world vision and language understanding applications. Our approach is structured around …
world vision and language understanding applications. Our approach is structured around …
Evalcrafter: Benchmarking and evaluating large video generation models
The vision and language generative models have been overgrown in recent years. For
video generation various open-sourced models and public-available services have been …
video generation various open-sourced models and public-available services have been …
Scene text recognition with permuted autoregressive sequence models
D Bautista, R Atienza - European conference on computer vision, 2022 - Springer
Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Textdiffuser-2: Unleashing the power of language models for text rendering
The diffusion model has been proven a powerful generative model in recent years, yet it
remains a challenge in generating visual text. Although existing work has endeavored to …
remains a challenge in generating visual text. Although existing work has endeavored to …
Hrs-bench: Holistic, reliable and scalable benchmark for text-to-image models
Designing robust text-to-image (T2I) models have been extensively explored in recent years,
especially with the emergence of diffusion models, which achieves state-of-the-art results on …
especially with the emergence of diffusion models, which achieves state-of-the-art results on …
Swintextspotter: Scene text spotting via better synergy between text detection and text recognition
End-to-end scene text spotting has attracted great attention in recent years due to the
success of excavating the intrinsic synergy of the scene text detection and recognition …
success of excavating the intrinsic synergy of the scene text detection and recognition …