Detector-free weakly supervised grounding by separation

F Liu, Y Liu, Y Kong, K Xu, L Zhang… - Proceedings of the …, 2023 - openaccess.thecvf.com

Abstract Existing Referring Image Segmentation (RIS) methods typically require expensive
pixel-level or box-level annotations for supervision. In this paper, we observe that the …

被引用次数：30 相关文章所有 8 个版本

[PDF] thecvf.com

Improved Visual Grounding through Self-Consistent Explanations

R He, P Cascante-Bonilla, Z Yang… - Proceedings of the …, 2024 - openaccess.thecvf.com

Vision-and-language models trained to match images with text can be combined with visual
explanation methods to point to the locations of specific objects in an image. Our work …

被引用次数：11 相关文章所有 3 个版本

[PDF] thecvf.com

Improving visual grounding by encouraging consistent gradient-based explanations

Z Yang, K Kafle, F Dernoncourt… - Proceedings of the …, 2023 - openaccess.thecvf.com

We propose a margin-based loss for tuning joint vision-language models so that their
gradient-based explanations are consistent with region-level annotations provided by …

被引用次数：26 相关文章所有 7 个版本

[PDF] thecvf.com

Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency

J Lee, S Lee, J Nam, S Yu, J Do… - Proceedings of the …, 2023 - openaccess.thecvf.com

Referring image segmentation (RIS) aims to localize the object in an image referred by a
natural language expression. Most previous studies learn RIS with a large-scale dataset …

被引用次数：15 相关文章所有 4 个版本

Visual cluster grounding for image captioning

W Jiang, M Zhu, Y Fang, G Shi… - IEEE Transactions on …, 2022 - ieeexplore.ieee.org

Attention mechanisms have been extensively adopted in vision and language tasks such as
image captioning. It encourages a captioning model to dynamically ground appropriate …

被引用次数：31 相关文章所有 5 个版本

[PDF] thecvf.com

Box-based refinement for weakly supervised and unsupervised localization tasks

E Gomel, T Shaharbany, L Wolf - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

It has been established that training a box-based detector network can enhance the
localization performance of weakly supervised and unsupervised methods. Moreover, we …

被引用次数：6 相关文章所有 5 个版本

[PDF] neurips.cc

What is where by looking: Weakly-supervised open-world phrase-grounding without text inputs

T Shaharabany, Y Tewel, L Wolf - Advances in Neural …, 2022 - proceedings.neurips.cc

Given an input image, and nothing else, our method returns the bounding boxes of objects
in the image and phrases that describe the objects. This is achieved within an open world …

被引用次数：20 相关文章所有 6 个版本

[PDF] thecvf.com

Similarity maps for self-training weakly-supervised phrase grounding

T Shaharabany, L Wolf - … of the IEEE/CVF Conference on …, 2023 - openaccess.thecvf.com

A phrase grounding model receives an input image and a text phrase and outputs a suitable
localization map. We present an effective way to refine a phrase ground model by …

被引用次数：10 相关文章所有 3 个版本

[PDF] arxiv.org

Weakly supervised grounding for vqa in vision-language transformers

AU Khan, H Kuehne, C Gan, NDV Lobo… - European Conference on …, 2022 - Springer

Transformers for visual-language representation learning have been getting a lot of interest
and shown tremendous performance on visual question answering (VQA) and grounding …

被引用次数：15 相关文章所有 5 个版本

[PDF] thecvf.com

Investigating Compositional Challenges in Vision-Language Models for Visual Grounding

Y Zeng, Y Huang, J Zhang, Z Jie… - Proceedings of the …, 2024 - openaccess.thecvf.com

Pre-trained vision-language models (VLMs) have achieved high performance on various
downstream tasks which have been widely used for visual grounding tasks in a weakly …

被引用次数：4 相关文章