[HTML][HTML] Audio self-supervised learning: A survey

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

Audio-visual scene analysis with self-supervised multisensory features

A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com
The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …

Learning problem-agnostic speech representations from multiple self-supervised tasks

S Pascual, M Ravanelli, J Serra, A Bonafonte… - arXiv preprint arXiv …, 2019 - arxiv.org
Learning good representations without supervision is still an open issue in machine
learning, and is particularly challenging for speech signals, which are often characterized by …

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

Speech2face: Learning the face behind a voice

TH Oh, T Dekel, C Kim, I Mosseri… - Proceedings of the …, 2019 - openaccess.thecvf.com
How much can we infer about a person's looks from the way they speak? In this paper, we
study the task of reconstructing a facial image of a person from a short audio recording of …

Distilling audio-visual knowledge by compositional contrastive learning

Y Chen, Y Xian, A Koepke, Y Shan… - Proceedings of the …, 2021 - openaccess.thecvf.com
Having access to multi-modal cues (eg vision and audio) empowers some cognitive tasks to
be done faster compared to learning from a single modality. In this work, we propose to …

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics

A Taleb, M Kirchler, R Monti… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com
High annotation costs are a substantial bottleneck in applying modern deep learning
architectures to clinically relevant medical use cases, substantiating the need for novel …

Multimodal self-supervised learning for medical image analysis

A Taleb, C Lippert, T Klein, M Nabi - International conference on …, 2021 - Springer
Self-supervised learning approaches leverage unlabeled samples to acquire generic
knowledge about different concepts, hence allowing for annotation-efficient downstream …