Multimodal image synthesis and editing: A survey and taxonomy
As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …
among multimodal information plays a key role for the creation and perception of multimodal …
An overview of deep-learning-based audio-visual speech enhancement and separation
Speech enhancement and speech separation are two related tasks, whose purpose is to
extract either one or more target speech signals, respectively, from a mixture of sounds …
extract either one or more target speech signals, respectively, from a mixture of sounds …
A large-scale study on unsupervised spatiotemporal representation learning
We present a large-scale study on unsupervised spatiotemporal representation learning
from videos. With a unified perspective on four recent image-based frameworks, we study a …
from videos. With a unified perspective on four recent image-based frameworks, we study a …
Contrastive multiview coding
Humans view the world through many sensory channels, eg, the long-wavelength light
channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right …
channel, viewed by the left eye, or the high-frequency vibrations channel, heard by the right …
Vggsound: A large-scale audio-visual dataset
Our goal is to collect a large-scale audio-visual dataset with low label noise from videosin
the wild'using computer vision techniques. The resulting dataset can be used for training …
the wild'using computer vision techniques. The resulting dataset can be used for training …
Videobert: A joint model for video and language representation learning
Self-supervised learning has become increasingly important to leverage the abundance of
unlabeled data available on platforms like YouTube. Whereas most existing approaches …
unlabeled data available on platforms like YouTube. Whereas most existing approaches …
Space-time correspondence as a contrastive random walk
This paper proposes a simple self-supervised approach for learning a representation for
visual correspondence from raw video. We cast correspondence as prediction of links in a …
visual correspondence from raw video. We cast correspondence as prediction of links in a …
Self-supervised learning of audio-visual objects from video
Our objective is to transform a video into a set of discrete audio-visual objects using self-
supervised learning. To this end, we introduce a model that uses attention to localize and …
supervised learning. To this end, we introduce a model that uses attention to localize and …
[HTML][HTML] Machine learning in acoustics: Theory and applications
Acoustic data provide scientific and engineering insights in fields ranging from biology and
communications to ocean and Earth science. We survey the recent advances and …
communications to ocean and Earth science. We survey the recent advances and …
Audio-visual scene analysis with self-supervised multisensory features
The thud of a bouncing ball, the onset of speech as lips open--when visual and audio events
occur together, it suggests that there might be a common, underlying event that produced …
occur together, it suggests that there might be a common, underlying event that produced …