Who's waldo? linking people across text and images

Z Han, F Zhu, Q Lao, H Jiang - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Zero-shot referring expression comprehension aims at localizing bounding boxes in an
image corresponding to provided textual prompts which requires:(i) a fine-grained …

被引用次数：10 相关文章所有 3 个版本

[PDF] thecvf.com

Is bert blind? exploring the effect of vision-and-language pretraining on visual language understanding

M Alper, M Fiman… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Most humans use visual imagination to understand and reason about language, but models
such as BERT reason about language using knowledge acquired during text-only …

被引用次数：13 相关文章所有 6 个版本

[PDF] thecvf.com

JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups

S Jahangard, Z Cai, S Wen… - Proceedings of the …, 2024 - openaccess.thecvf.com

Understanding human social behaviour is crucial in computer vision and robotics. Micro-
level observations like individual actions fall short necessitating a comprehensive approach …

被引用次数：2 相关文章所有 3 个版本

[PDF] thecvf.com

Weakly supervised face naming with symmetry-enhanced contrastive loss

T Qu, T Tuytelaars, MF Moens - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

We revisit the weakly supervised cross-modal face-name alignment task; that is, given an
image and a caption, we label the faces in the image with the names occurring in the …

被引用次数：4 相关文章所有 8 个版本

[PDF] thecvf.com

Learning Human-Human Interactions in Images from Weak Textual Supervision

M Alper, H Averbuch-Elor - Proceedings of the IEEE/CVF …, 2023 - openaccess.thecvf.com

Interactions between humans are diverse and context-dependent, but previous works have
treated them as categorical, disregarding the heavy tail of possible interactions. We propose …

被引用次数：1 相关文章所有 6 个版本

[PDF] arxiv.org

Find someone who: Visual commonsense understanding in human-centric grounding

H You, R Sun, Z Wang, KW Chang… - arXiv preprint arXiv …, 2022 - arxiv.org

From a visual scene containing multiple people, human is able to distinguish each individual
given the context descriptions about what happened before, their mental/physical states or …

被引用次数：5 相关文章所有 3 个版本

[PDF] arxiv.org

What's in a Decade? Transforming Faces Through Time

EM Chen, J Sun, A Khandelwal… - Computer Graphics …, 2023 - Wiley Online Library

How can one visually characterize photographs of people over time? In this work, we
describe the Faces Through Time dataset, which contains over a thousand portrait images …

被引用次数：6 相关文章所有 8 个版本

[PDF] arxiv.org

Semi-supervised multimodal coreference resolution in image narrations

A Goel, B Fernando, F Keller, H Bilen - arXiv preprint arXiv:2310.13619, 2023 - arxiv.org

In this paper, we study multimodal coreference resolution, specifically where a longer
descriptive text, ie, a narration is paired with an image. This poses significant challenges …

被引用次数：5 相关文章所有 6 个版本

[PDF] arxiv.org

To Find Waldo You Need Contextual Cues: Debiasing Who's Waldo

Y Luo, P Banerjee, T Gokhale, Y Yang… - arXiv preprint arXiv …, 2022 - arxiv.org

We present a debiased dataset for the Person-centric Visual Grounding (PCVG) task first
proposed by Cui et al.(2021) in the Who's Waldo dataset. Given an image and a caption …

被引用次数：5 相关文章所有 5 个版本

[PDF] thecvf.com

Who are you referring to? Coreference resolution in image narrations

A Goel, B Fernando, F Keller… - Proceedings of the IEEE …, 2023 - openaccess.thecvf.com

Coreference resolution aims to identify words and phrases which refer to the same entity in a
text, a core task in natural language processing. In this paper, we extend this task to …

被引用次数：3 相关文章所有 8 个版本