[HTML][HTML] Audio self-supervised learning: A survey
Similar to humans' cognitive ability to generalize knowledge and skills, self-supervised
learning (SSL) targets discovering general representations from large-scale data. This …
learning (SSL) targets discovering general representations from large-scale data. This …
Self-supervised learning of audio-visual objects from video
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …
supervised learning. To this end, we introduce a model that uses attention to localize and …
Audio-visual scene analysis with self-supervised multisensory features
The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …
occur together, it suggests that there might be a common, underlying event that produced …
Learning problem-agnostic speech representations from multiple self-supervised tasks
Learning good representations without supervision is still an open issue in machine
learning, and is particularly challenging for speech signals, which are often characterized by …
learning, and is particularly challenging for speech signals, which are often characterized by …
Sound to visual scene generation by audio-to-visual latent alignment
How does audio describe the world around us? In this paper, we propose a method for
generating an image of a scene from sound. Our method addresses the challenges of …
generating an image of a scene from sound. Our method addresses the challenges of …
Speech2face: Learning the face behind a voice
How much can we infer about a person's looks from the way they speak? In this paper, we
study the task of reconstructing a facial image of a person from a short audio recording of …
study the task of reconstructing a facial image of a person from a short audio recording of …
Distilling audio-visual knowledge by compositional contrastive learning
Having access to multi-modal cues (eg vision and audio) empowers some cognitive tasks to
be done faster compared to learning from a single modality. In this work, we propose to …
be done faster compared to learning from a single modality. In this work, we propose to …
Audio-visual generalised zero-shot learning with cross-modal attention and language
Learning to classify video data from classes not included in the training data, ie video-based
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
zero-shot learning, is challenging. We conjecture that the natural alignment between the …
Contig: Self-supervised multimodal contrastive learning for medical imaging with genetics
High annotation costs are a substantial bottleneck in applying modern deep learning
architectures to clinically relevant medical use cases, substantiating the need for novel …
architectures to clinically relevant medical use cases, substantiating the need for novel …
Multimodal self-supervised learning for medical image analysis
Self-supervised learning approaches leverage unlabeled samples to acquire generic
knowledge about different concepts, hence allowing for annotation-efficient downstream …
knowledge about different concepts, hence allowing for annotation-efficient downstream …