How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Z Chen, W Wang, H Tian, S Ye, Z Gao, E Cui… - arXiv preprint arXiv …, 2024 - arxiv.org
In this report, we introduce InternVL 1.5, an open-source multimodal large language model
(MLLM) to bridge the capability gap between open-source and proprietary commercial …

Towards robust tampered text detection in document image: New dataset and new solution

C Qu, C Liu, Y Liu, X Chen, D Peng… - Proceedings of the …, 2023 - openaccess.thecvf.com
Recently, tampered text detection in document image has attracted increasingly attention
due to its essential role on information security. However, detecting visually consistent …

Modeling entities as semantic points for visual information extraction in the wild

Z Yang, R Long, P Wang, S Song… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Recently, Visual Information Extraction (VIE) has been becoming increasingly
important in both academia and industry, due to the wide range of real-world applications …

Uav localization in low-altitude gnss-denied environments based on poi and store signage text matching in uav images

Y Liu, J Bai, G Wang, X Wu, F Sun, Z Guo, H Geng - Drones, 2023 - mdpi.com
Localization is the most important basic information for unmanned aerial vehicles (UAV)
during their missions. Currently, most UAVs use GNSS to calculate their own position …

Bridging the Gap Between End-to-End and Two-Step Text Spotting

M Huang, H Li, Y Liu, X Bai… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Modularity plays a crucial role in the development and maintenance of complex systems.
While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub …

Context perception parallel decoder for scene text recognition

Y Du, Z Chen, C Jia, X Yin, C Li, Y Du… - arXiv preprint arXiv …, 2023 - arxiv.org
Scene text recognition (STR) methods have struggled to attain high accuracy and fast
inference speed. Autoregressive (AR)-based STR model uses the previously recognized …

Toward real text manipulation detection: New dataset and new solution

D Luo, Y Liu, R Yang, X Liu, J Zeng, Y Zhou, X Bai - Pattern Recognition, 2025 - Elsevier
With the surge in realistic text tampering, detecting fraudulent text in images has gained
prominence for maintaining information security. However, the high costs associated with …

ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting

C Duan, P Fu, S Guo, Q Jiang… - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
In recent years text-image joint pre-training techniques have shown promising results in
various tasks. However in Optical Character Recognition (OCR) tasks aligning text instances …

A Comprehensive Framework for Industrial Sticker Information Recognition Using Advanced OCR and Object Detection Techniques

G Monteiro, L Camelo, G Aquino, RA Fernandes… - Applied Sciences, 2023 - mdpi.com
Recent advancements in Artificial Intelligence (AI), deep learning (DL), and computer vision
have revolutionized various industrial processes through image classification and object …

ESGNet: A multimodal network model incorporating entity semantic graphs for information extraction from Chinese resumes

S Luo, J Yu - Information Processing & Management, 2024 - Elsevier
Corporations require screening critical information from numerous resumes with different
formats and content for managerial decision-making. However, traditional manual screening …