Learning sight from sound: Ambient sound provides supervision for visual learning

S Liu, A Mallol-Ragolta, E Parada-Cabaleiro, K Qian… - Patterns, 2022 - cell.com

Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …

被引用次数：119 相关文章所有 12 个版本

[PDF] arxiv.org

Self-supervised learning of audio-visual objects from video

T Afouras, A Owens, JS Chung, A Zisserman - Computer Vision–ECCV …, 2020 - Springer

Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …

被引用次数：286 相关文章所有 8 个版本

[PDF] thecvf.com

Audio-visual scene analysis with self-supervised multisensory features

A Owens, AA Efros - Proceedings of the European …, 2018 - openaccess.thecvf.com

The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …

被引用次数：894 相关文章所有 8 个版本

[PDF] arxiv.org

Learning problem-agnostic speech representations from multiple self-supervised tasks

S Pascual, M Ravanelli, J Serra, A Bonafonte… - arXiv preprint arXiv …, 2019 - arxiv.org

Learning good representations without supervision is still an open issue in machine
learning, and is particularly challenging for speech signals, which are often characterized by …

被引用次数：277 相关文章所有 10 个版本

[PDF] thecvf.com

Sound to visual scene generation by audio-to-visual latent alignment

K Sung-Bin, A Senocak, H Ha… - Proceedings of the …, 2023 - openaccess.thecvf.com

How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …

被引用次数：32 相关文章所有 6 个版本

[PDF] thecvf.com

Speech2face: Learning the face behind a voice

TH Oh, T Dekel, C Kim, I Mosseri… - Proceedings of the …, 2019 - openaccess.thecvf.com

How much can we infer about a person's looks from the way they speak? In this paper, we
study the task of reconstructing a facial image of a person from a short audio recording of …

被引用次数：213 相关文章所有 10 个版本

[PDF] thecvf.com

Distilling audio-visual knowledge by compositional contrastive learning

Y Chen, Y Xian, A Koepke, Y Shan… - Proceedings of the …, 2021 - openaccess.thecvf.com

Having access to multi-modal cues (eg vision and audio) empowers some cognitive tasks to
be done faster compared to learning from a single modality. In this work, we propose to …

被引用次数：90 相关文章所有 9 个版本

[PDF] thecvf.com

Audio-visual generalised zero-shot learning with cross-modal attention and language

OB Mercea, L Riesch, A Koepke… - Proceedings of the …, 2022 - openaccess.thecvf.com

Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …

被引用次数：59 相关文章所有 8 个版本

[PDF] thecvf.com

Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics

A Taleb, M Kirchler, R Monti… - Proceedings of the IEEE …, 2022 - openaccess.thecvf.com

High annotation costs are a substantial bottleneck in applying modern deep learning
architectures to clinically relevant medical use cases, substantiating the need for novel …

被引用次数：68 相关文章所有 5 个版本

[PDF] arxiv.org

Multimodal self-supervised learning for medical image analysis

A Taleb, C Lippert, T Klein, M Nabi - International conference on …, 2021 - Springer

Self-supervised learning approaches leverage unlabeled samples to acquire generic
knowledge about different concepts, hence allowing for annotation-efficient downstream …

被引用次数：144 相关文章所有 6 个版本