Visuals to text: A comprehensive review on automatic image captioning

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk
Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer
Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

Open-vocabulary object detection using captions

A Zareian, KD Rosa, DH Hu… - Proceedings of the …, 2021 - openaccess.thecvf.com
Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …

Scanrefer: 3d object localization in rgb-d scans using natural language

DZ Chen, AX Chang, M Nießner - European conference on computer …, 2020 - Springer
We introduce the task of 3D object localization in RGB-D scans using natural language
descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free …

Counterfactual contrastive learning for weakly-supervised vision-language grounding

Z Zhang, Z Zhao, Z Lin, X He - Advances in Neural …, 2020 - proceedings.neurips.cc
Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …

More grounded image captioning by distilling image-text matching model

Y Zhou, M Wang, D Liu, Z Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com
Visual attention not only improves the performance of image captioners, but also serves as a
visual interpretation to qualitatively measure the caption rationality and model transparency …

Referring image segmentation using text supervision

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

What does BERT with vision look at?

LH Li, M Yatskar, D Yin, CJ Hsieh… - Proceedings of the 58th …, 2020 - aclanthology.org
Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER
have achieved significant performance improvement on vision-and-language tasks but what …

Pseudo-q: Generating pseudo language queries for visual grounding

H Jiang, Y Lin, D Han, S Song… - Proceedings of the …, 2022 - openaccess.thecvf.com
Visual grounding, ie, localizing objects in images according to natural language queries, is
an important topic in visual language understanding. The most effective approaches for this …

Contrastive learning for weakly supervised phrase grounding

T Gupta, A Vahdat, G Chechik, X Yang, J Kautz… - … on Computer Vision, 2020 - Springer
Phrase grounding, the problem of associating image regions to caption words, is a crucial
component of vision-language tasks. We show that phrase grounding can be learned by …