Zero-shot referring expression comprehension via structural similarity between images and captions

Z Han, F Zhu, Q Lao, H Jiang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …

Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding

M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Most humans use visual imagination to understand and reason about language, but models
such as BERT reason about language using knowledge acquired during text-only …

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

S Jahangard, Z Cai, S Wen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding human social behaviour is crucial in computer vision and robotics. Micro-
level observations like individual actions fall short necessitating a comprehensive approach …

Weakly supervised face naming with symmetry-enhanced contrastive loss

T Qu, T Tuytelaars, MF Moens - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
We revisit the weakly supervised cross-modal face-name alignment task; that is, given an
image and a caption, we label the faces in the image with the names occurring in the …

Learning Human-Human Interactions in Images from Weak Textual Supervision

M Alper, H Averbuch-Elor - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Interactions between humans are diverse and context-dependent, but previous works have
treated them as categorical, disregarding the heavy tail of possible interactions. We propose …

Find someone who: Visual commonsense understanding in human-centric grounding

H You, R Sun, Z Wang, KW Chang… - arXiv preprint arXiv …, 2022 - arxiv.org
From a visual scene containing multiple people, human is able to distinguish each individual
given the context descriptions about what happened before, their mental/physical states or …

What's in a Decade? Transforming Faces Through Time

EM Chen, J Sun, A Khandelwal… - Computer Graphics …, 2023 - Wiley Online Library
How can one visually characterize photographs of people over time? In this work, we
describe the Faces Through Time dataset, which contains over a thousand portrait images …

Semi-supervised multimodal coreference resolution in image narrations

A Goel, B Fernando, F Keller, H Bilen - arXiv preprint arXiv:2310.13619, 2023 - arxiv.org
In this paper, we study multimodal coreference resolution, specifically where a longer
descriptive text, ie, a narration is paired with an image. This poses significant challenges …

To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo

Y Luo, P Banerjee, T Gokhale, Y Yang… - arXiv preprint arXiv …, 2022 - arxiv.org
We present a debiased dataset for the Person-centric Visual Grounding (PCVG) task first
proposed by Cui et al.(2021) in the Who's Waldo dataset. Given an image and a caption …

Who are you referring to? Coreference resolution in image narrations

A Goel, B Fernando, F Keller… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Coreference resolution aims to identify words and phrases which refer to the same entity in a
text, a core task in natural language processing. In this paper, we extend this task to …