Align2ground: Weakly supervised phrase grounding guided by image-caption alignment

Y Ming, N Hu, C Fan, F Feng… - IEEE/CAA Journal of …, 2022 - researchportal.port.ac.uk

Image captioning refers to automatic generation of descriptive texts according to the visual
content of images. It is a technique integrating multiple disciplines including the computer …

被引用次数：35 相关文章所有 6 个版本

[PDF] arxiv.org

Making the most of text semantics to improve biomedical vision–language processing

B Boecking, N Usuyama, S Bannur, DC Castro… - European conference on …, 2022 - Springer

Multi-modal data abounds in biomedicine, such as radiology images and reports.
Interpreting this data at scale is essential for improving clinical care and accelerating clinical …

被引用次数：170 相关文章所有 9 个版本

[PDF] thecvf.com

Open-vocabulary object detection using captions

A Zareian, KD Rosa, DH Hu… - Proceedings of the …, 2021 - openaccess.thecvf.com

Despite the remarkable accuracy of deep neural networks in object detection, they are costly
to train and scale due to supervision requirements. Particularly, learning more object …

被引用次数：349 相关文章所有 6 个版本

[PDF] arxiv.org

Scanrefer: 3d object localization in rgb-d scans using natural language

DZ Chen, AX Chang, M Nießner - European conference on computer …, 2020 - Springer

We introduce the task of 3D object localization in RGB-D scans using natural language
descriptions. As input, we assume a point cloud of a scanned 3D scene along with a free …

被引用次数：271 相关文章所有 5 个版本

[PDF] neurips.cc

Counterfactual contrastive learning for weakly-supervised vision-language grounding

Z Zhang, Z Zhao, Z Lin, X He - Advances in Neural …, 2020 - proceedings.neurips.cc

Weakly-supervised vision-language grounding aims to localize a target moment in a video
or a specific region in an image according to the given sentence query, where only video …

被引用次数：110 相关文章所有 5 个版本

[PDF] thecvf.com

More grounded image captioning by distilling image-text matching model

Y Zhou, M Wang, D Liu, Z Hu… - Proceedings of the …, 2020 - openaccess.thecvf.com

Visual attention not only improves the performance of image captioners, but also serves as a
visual interpretation to qualitatively measure the caption rationality and model transparency …

被引用次数：154 相关文章所有 9 个版本

[PDF] thecvf.com

Referring image segmentation using text supervision

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

被引用次数：13 相关文章所有 8 个版本

[PDF] aclanthology.org

What does BERT with vision look at?

LH Li, M Yatskar, D Yin, CJ Hsieh… - Proceedings of the 58th …, 2020 - aclanthology.org

Pre-trained visually grounded language models such as ViLBERT, LXMERT, and UNITER
have achieved significant performance improvement on vision-and-language tasks but what …

被引用次数：148 相关文章所有 3 个版本

[PDF] thecvf.com

Pseudo-q: Generating pseudo language queries for visual grounding

H Jiang, Y Lin, D Han, S Song… - Proceedings of the …, 2022 - openaccess.thecvf.com

Visual grounding, ie, localizing objects in images according to natural language queries, is
an important topic in visual language understanding. The most effective approaches for this …

被引用次数：55 相关文章所有 5 个版本

[PDF] arxiv.org

Contrastive learning for weakly supervised phrase grounding

T Gupta, A Vahdat, G Chechik, X Yang, J Kautz… - … on Computer Vision, 2020 - Springer

Phrase grounding, the problem of associating image regions to caption words, is a crucial
component of vision-language tasks. We show that phrase grounding can be learned by …

被引用次数：139 相关文章所有 6 个版本