Zero-shot referring expression comprehension via structural similarity between images and captions
Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …
image corresponding to provided textual prompts which requires:(i) a fine-grained …
Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding
M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com
Most humans use visual imagination to understand and reason about language, but models
such as BERT reason about language using knowledge acquired during text-only …
such as BERT reason about language using knowledge acquired during text-only …
JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups
S Jahangard, Z Cai, S Wen… - Proceedings of the …, 2024 - openaccess.thecvf.com
Understanding human social behaviour is crucial in computer vision and robotics. Micro-
level observations like individual actions fall short necessitating a comprehensive approach …
level observations like individual actions fall short necessitating a comprehensive approach …
Weakly supervised face naming with symmetry-enhanced contrastive loss
We revisit the weakly supervised cross-modal face-name alignment task; that is, given an
image and a caption, we label the faces in the image with the names occurring in the …
image and a caption, we label the faces in the image with the names occurring in the …
Learning Human-Human Interactions in Images from Weak Textual Supervision
M Alper, H Averbuch-Elor - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com
Interactions between humans are diverse and context-dependent, but previous works have
treated them as categorical, disregarding the heavy tail of possible interactions. We propose …
treated them as categorical, disregarding the heavy tail of possible interactions. We propose …
Find someone who: Visual commonsense understanding in human-centric grounding
From a visual scene containing multiple people, human is able to distinguish each individual
given the context descriptions about what happened before, their mental/physical states or …
given the context descriptions about what happened before, their mental/physical states or …
What's in a Decade? Transforming Faces Through Time
How can one visually characterize photographs of people over time? In this work, we
describe the Faces Through Time dataset, which contains over a thousand portrait images …
describe the Faces Through Time dataset, which contains over a thousand portrait images …
Semi-supervised multimodal coreference resolution in image narrations
In this paper, we study multimodal coreference resolution, specifically where a longer
descriptive text, ie, a narration is paired with an image. This poses significant challenges …
descriptive text, ie, a narration is paired with an image. This poses significant challenges …
To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo
We present a debiased dataset for the Person-centric Visual Grounding (PCVG) task first
proposed by Cui et al.(2021) in the Who's Waldo dataset. Given an image and a caption …
proposed by Cui et al.(2021) in the Who's Waldo dataset. Given an image and a caption …
Who are you referring to? Coreference resolution in image narrations
Coreference resolution aims to identify words and phrases which refer to the same entity in a
text, a core task in natural language processing. In this paper, we extend this task to …
text, a core task in natural language processing. In this paper, we extend this task to …