Referring image segmentation using text supervision

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com
Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

Improved Visual Grounding through Self-Consistent Explanations

R He, P Cascante-Bonilla, Z Yang… - Proceedings of the …, 2024 - openaccess.thecvf.com
Vision-and-language models trained to match images with text can be combined with visual
explanation methods to point to the locations of specific objects in an image. Our work …

Improving visual grounding by encouraging consistent gradient-based explanations

Z Yang, K Kafle, F Dernoncourt… - Proceedings of the …, 2023 - openaccess.thecvf.com
We propose a margin-based loss for tuning joint vision-language models so that their
gradient-based explanations are consistent with region-level annotations provided by …

Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency

J Lee, S Lee, J Nam, S Yu, J Do… - Proceedings of the …, 2023 - openaccess.thecvf.com
Referring image segmentation (RIS) aims to localize the object in an image referred by a
natural language expression. Most previous studies learn RIS with a large-scale dataset …

Visual cluster grounding for image captioning

W Jiang, M Zhu, Y Fang, G Shi… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org
Attention mechanisms have been extensively adopted in vision and language tasks such as
image captioning. It encourages a captioning model to dynamically ground appropriate …

Box-based refinement for weakly supervised and unsupervised localization tasks

E Gomel, T Shaharbany, L Wolf - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
It has been established that training a box-based detector network can enhance the
localization performance of weakly supervised and unsupervised methods. Moreover, we …

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

T Shaharabany, Y Tewel, L Wolf - Advances in Neural …, 2022 - proceedings.neurips.cc
Given an input image, and nothing else, our method returns the bounding boxes of objects
in the image and phrases that describe the objects. This is achieved within an open world …

Similarity maps for self-training weakly-supervised phrase grounding

T Shaharabany, L Wolf - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
A phrase grounding model receives an input image and a text phrase and outputs a suitable
localization map. We present an effective way to refine a phrase ground model by …

Weakly supervised grounding for vqa in vision-language transformers

AU Khan, H Kuehne, C Gan, NDV Lobo… - European Conference on …, 2022 - Springer
Transformers for visual-language representation learning have been getting a lot of interest
and shown tremendous performance on visual question answering (VQA) and grounding …

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …