Referring image segmentation using text supervision
Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …
pixel-level or box-level annotations for supervision. In this paper, we observe that the …
Improved Visual Grounding through Self-Consistent Explanations
Vision-and-language models trained to match images with text can be combined with visual
explanation methods to point to the locations of specific objects in an image. Our work …
explanation methods to point to the locations of specific objects in an image. Our work …
Improving visual grounding by encouraging consistent gradient-based explanations
We propose a margin-based loss for tuning joint vision-language models so that their
gradient-based explanations are consistent with region-level annotations provided by …
gradient-based explanations are consistent with region-level annotations provided by …
Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency
Referring image segmentation (RIS) aims to localize the object in an image referred by a
natural language expression. Most previous studies learn RIS with a large-scale dataset …
natural language expression. Most previous studies learn RIS with a large-scale dataset …
Visual cluster grounding for image captioning
Attention mechanisms have been extensively adopted in vision and language tasks such as
image captioning. It encourages a captioning model to dynamically ground appropriate …
image captioning. It encourages a captioning model to dynamically ground appropriate …
Box-based refinement for weakly supervised and unsupervised localization tasks
It has been established that training a box-based detector network can enhance the
localization performance of weakly supervised and unsupervised methods. Moreover, we …
localization performance of weakly supervised and unsupervised methods. Moreover, we …
What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs
Given an input image, and nothing else, our method returns the bounding boxes of objects
in the image and phrases that describe the objects. This is achieved within an open world …
in the image and phrases that describe the objects. This is achieved within an open world …
Similarity maps for self-training weakly-supervised phrase grounding
T Shaharabany, L Wolf - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com
A phrase grounding model receives an input image and a text phrase and outputs a suitable
localization map. We present an effective way to refine a phrase ground model by …
localization map. We present an effective way to refine a phrase ground model by …
Weakly supervised grounding for vqa in vision-language transformers
Transformers for visual-language representation learning have been getting a lot of interest
and shown tremendous performance on visual question answering (VQA) and grounding …
and shown tremendous performance on visual question answering (VQA) and grounding …
Investigating Compositional Challenges in Vision-Language Models for Visual Grounding
Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …
downstream tasks which have been widely used for visual grounding tasks in a weakly …