Contrastive representation learning: A framework and review

PH Le-Khac, G Healy, AF Smeaton - Ieee Access, 2020 - ieeexplore.ieee.org
Contrastive Learning has recently received interest due to its success in self-supervised
representation learning in the computer vision domain. However, the origins of Contrastive …

Self-supervised learning for videos: A survey

MC Schiappa, YS Rawat, M Shah - ACM Computing Surveys, 2023 - dl.acm.org
The remarkable success of deep learning in various domains relies on the availability of
large-scale annotated datasets. However, obtaining annotations is expensive and requires …

Ego4d: Around the world in 3,000 hours of egocentric video

K Grauman, A Westbury, E Byrne… - Proceedings of the …, 2022 - openaccess.thecvf.com
We introduce Ego4D, a massive-scale egocentric video dataset and benchmark suite. It
offers 3,670 hours of daily-life activity video spanning hundreds of scenarios (household …

Concealed object detection

DP Fan, GP Ji, MM Cheng… - IEEE transactions on …, 2021 - ieeexplore.ieee.org
We present the first systematic study on concealed object detection (COD), which aims to
identify objects that are visually embedded in their background. The high intrinsic similarities …

Keeping your eye on the ball: Trajectory attention in video transformers

M Patrick, D Campbell, Y Asano… - Advances in neural …, 2021 - proceedings.neurips.cc
In video transformers, the time dimension is often treated in the same way as the two spatial
dimensions. However, in a scene where objects or the camera may move, a physical point …

Audio–visual segmentation

J Zhou, J Wang, J Zhang, W Sun, J Zhang… - … on Computer Vision, 2022 - Springer
We propose to explore a new problem called audio-visual segmentation (AVS), in which the
goal is to output a pixel-level map of the object (s) that produce sound at the time of the …

Vision transformers are parameter-efficient audio-visual learners

YB Lin, YL Sung, J Lei, M Bansal… - Proceedings of the …, 2023 - openaccess.thecvf.com
Vision transformers (ViTs) have achieved impressive results on various computer vision
tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained …

Localizing objects with self-supervised transformers and no labels

O Siméoni, G Puy, HV Vo, S Roburin, S Gidaris… - arXiv preprint arXiv …, 2021 - arxiv.org
Localizing objects in image collections without supervision can help to avoid expensive
annotation campaigns. We propose a simple approach to this problem, that leverages the …

Localizing visual sounds the hard way

H Chen, W Xie, T Afouras, A Nagrani… - Proceedings of the …, 2021 - openaccess.thecvf.com
The objective of this work is to localize sound sources that are visible in a video without
using manual annotations. Our key technical contribution is to show that, by training the …

Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection

R Tao, Z Pan, RK Das, X Qian, MZ Shou… - Proceedings of the 29th …, 2021 - dl.acm.org
Active speaker detection (ASD) seeks to detect who is speaking in a visual scene of one or
more speakers. The successful ASD depends on accurate interpretation of short-term and …