Self-supervised learning of audio-visual objects from video

PH Le-Khac, G Healy, AF Smeaton - Ieee Access, 2020 - ieeexplore.ieee.org

Contrastive Learning has recently received interest due to its success in self-supervised
representation learning in the computer vision domain. However, the origins of Contrastive …

被引用次数：686 相关文章所有 10 个版本

[PDF] arxiv.org

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org

The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

被引用次数：105 相关文章所有 4 个版本

[PDF] thecvf.com

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com

We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

被引用次数：685 相关文章所有 13 个版本

[PDF] arxiv.org

Concealed object detection

DP Fan, GP Ji, MM Cheng… - IEEE transactions on …, 2021 - ieeexplore.ieee.org

We present the first systematic study on concealed object detection (COD), which aims to
identify objects that are visually embedded in their background. The high intrinsic similarities …

被引用次数：336 相关文章所有 11 个版本

[PDF] neurips.cc

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc

In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

被引用次数：242 相关文章所有 13 个版本

[PDF] arxiv.org

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer

We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

被引用次数：101 相关文章所有 5 个版本

[PDF] thecvf.com

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com

Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

被引用次数：50 相关文章所有 5 个版本

[PDF] arxiv.org

Localizing objects with self-supervised transformers and no labels

O Siméoni, G Puy, HV Vo, S Roburin, S Gidaris… - arXiv preprint arXiv …, 2021 - arxiv.org

Localizing objects in image collections without supervision can help to avoid expensive
annotation campaigns. We propose a simple approach to this problem, that leverages the …

被引用次数：160 相关文章所有 8 个版本

[PDF] thecvf.com

Localizing visual sounds the hard way

H Chen, W Xie, T Afouras, A Nagrani… - Proceedings of the …, 2021 - openaccess.thecvf.com

The objective of this work is to localize sound sources that are visible in a video without
using manual annotations. Our key technical contribution is to show that, by training the …

被引用次数：177 相关文章所有 7 个版本

[PDF] acm.org

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org

Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …

被引用次数：154 相关文章所有 5 个版本