Vision transformers need registers

T Darcet, M Oquab, J Mairal, P Bojanowski - arXiv preprint arXiv …, 2023 - arxiv.org
Transformers have recently emerged as a powerful tool for learning visual representations.
In this paper, we identify and characterize artifacts in feature maps of both supervised and …

Separating the" Chirp" from the" Chat": Self-supervised Visual Grounding of Sound and Language

M Hamilton, A Zisserman… - Proceedings of the …, 2024 - openaccess.thecvf.com
We present DenseAV a novel dual encoder grounding architecture that learns high-
resolution semantically meaningful and audio-visual aligned features solely through …

On Train-Test Class Overlap and Detection for Image Retrieval

CH Song, J Yoon, T Hwang, S Choi… - Proceedings of the …, 2024 - openaccess.thecvf.com
How important is it for training and evaluation sets to not have class overlap in image
retrieval? We revisit Google Landmarks v2 clean the most popular training set by identifying …

ULTRON: Unifying Local Transformer and Convolution for Large-scale Image Retrieval

M Kweon, J Park - Proceedings of the Asian Conference on …, 2024 - openaccess.thecvf.com
In large-scale image retrieval, the primary goal is to extract discriminative features and
embed them into global image representations. Previous methods based on CNNs …

Occlusion-Aware Seamless Segmentation

Y Cao, J Zhang, H Shi, K Peng, Y Zhang… - arXiv preprint arXiv …, 2024 - Springer
Panoramic images can broaden the Field of View (FoV), occlusion-aware prediction can
deepen the understanding of the scene, and domain adaptation can transfer across viewing …

A research for sound event localization and detection based on local–global adaptive fusion and temporal importance network

D Shi, M Guo, M Ma - Multimedia Systems, 2024 - Springer
Sound event localization and detection systems can provide intelligent sound processing
and analysis functions for various application devices. However, existing deep learning …